Data Architecture

Data Architecture Overview

Enterprise Architecture:
- Definition: The design of systems to support change in an enterprise through flexible and reversible decisions.
- Four Main Areas:
  1. Business Architecture: Strategy and business model.
  2. Application Architecture: Structure and interaction of key applications.
  3. Technical Architecture: Software and hardware components.
  4. Data Architecture: Supports evolving data needs.
Change Management:
- Organizations evolve, and data architecture must adapt.
- One-Way vs. Two-Way Doors:
  - One-Way Doors: Irreversible decisions (e.g., selling AWS).
  - Two-Way Doors: Reversible decisions (e.g., changing storage classes in S3).
- Aim for two-way door decisions to handle change effectively.

Definition: The structure of a system reflects the communication structure of the organization.
Implication:
- Siloed departments → Siloed data systems.
- Cross-functional collaboration → Integrated data systems.
Key Takeaway: Understand your organization’s communication structure to design effective data systems.

Choose Common Components Wisely:
- Use components like object storage, version control, and orchestration systems that facilitate collaboration.
- Avoid a one-size-fits-all approach.
Architecture is Leadership:
- Mentor others and provide training on common components.
- Seek mentorship from data architects.
Always Be Architecting:
- Data architecture is an ongoing process.
- Build systems that evolve with organizational needs.
Build Loosely Coupled Systems: Use interchangeable components for flexibility.
Make Reversible Decisions: Design systems that allow for easy changes.
Plan for Failure: Anticipate system failures and design for resilience.
Architect for Scalability: Build systems that scale up and down with demand.
Prioritize Security: Implement zero-trust security (no default trust, authenticate every action).
Embrace FinOps: Optimize costs while maximizing revenue potential.

Key Metrics:
- Availability (Uptime): Percentage of time a system is operational (e.g., 99.99% uptime = ~1 hour downtime/year).
- Reliability: Probability of a system performing its function within defined standards.
- Durability: Ability to withstand data loss (e.g., Amazon S3 offers 99.999999999% durability).
Recovery Objectives:
- RTO (Recovery Time Objective): Maximum acceptable downtime.
- RPO (Recovery Point Objective): Maximum acceptable data loss.
Security:
- Zero-Trust Security: Authenticate every action; no default trust.
- Avoid hardened perimeter security (trust inside, untrusted outside).
Cost and Scalability:
- Use FinOps to manage dynamic Cloud costs (e.g., on-demand vs. spot instances).
- Scale systems to handle demand spikes without crashing.

Definition: Processes data in chunks (batches) at fixed intervals.
Use Case: When real-time analysis is not critical (e.g., daily sales reports).
ETL vs. ELT:
- ETL (Extract, Transform, Load): Transform data before loading into a data warehouse.
- ELT (Extract, Load, Transform): Load data into a warehouse first, then transform.
Data Marts:
- Subsets of a data warehouse focused on specific departments (e.g., sales, marketing).
- Improve query performance and accessibility for analysts.
Key Considerations:
- Choose common components for collaboration.
- Plan for source system failures or schema changes.
- Optimize for cost and performance.

Definition: Processes data in near real-time as a continuous stream.
Components:
- Producer: Data source (e.g., clickstream data, IoT devices).
- Consumer: Service or application that processes data.
- Streaming Broker: Coordinates data between producers and consumers.
Lambda Architecture:
- Combines batch and streaming processing.
- Challenges: Managing parallel systems with different codebases.
Kappa Architecture:
- Uses a stream processing platform as the backbone.
- Treats batch processing as a special case of streaming.
Modern Approaches:
- Apache Beam: Unifies batch and streaming with a single codebase.
- Apache Flink: Popular stream processing tool.
Key Takeaway: Streaming architectures are essential for real-time analytics and event-based systems.

Importance: Avoid lawsuits and fines by adhering to regulations.
Key Regulations:
- :
  - Protects personal data in the EU.
  - Requires consent and the ability to delete data.
- : Protects sensitive patient data in the US.
- Sarbanes-Oxley Act: Mandates financial reporting and record-keeping.
Best Practices:
- Build systems that comply with modern regulations (e.g., GDPR).
- Design flexible, loosely coupled systems to adapt to regulatory changes.

Data Architecture: Supports evolving data needs and aligns with enterprise goals.
Principles: Choose common components, prioritize security, and embrace FinOps.
Batch vs. Streaming: Batch for periodic processing; streaming for real-time analytics.
Compliance: Build systems that adhere to regulations like GDPR and HIPAA.
Flexibility: Design systems that can adapt to changing business needs and regulations.

Source: DeepLearning.ai data engineering course.