Data Architecture Overview

Data Architecture in Enterprise Context

  • Enterprise Architecture:
    • Definition: The design of systems to support change in an enterprise through flexible and reversible decisions.
    • Four Main Areas:
      1. Business Architecture: Strategy and business model.
      2. Application Architecture: Structure and interaction of key applications.
      3. Technical Architecture: Software and hardware components.
      4. Data Architecture: Supports evolving data needs.
  • Change Management:
    • Organizations evolve, and data architecture must adapt.
    • One-Way vs. Two-Way Doors:
      • One-Way Doors: Irreversible decisions (e.g., selling AWS).
      • Two-Way Doors: Reversible decisions (e.g., changing storage classes in S3).
    • Aim for two-way door decisions to handle change effectively.

Conway’s Law

  • Definition: The structure of a system reflects the communication structure of the organization.
  • Implication:
    • Siloed departments → Siloed data systems.
    • Cross-functional collaboration → Integrated data systems.
  • Key Takeaway: Understand your organization’s communication structure to design effective data systems.

Principles of Good Data Architecture

  1. Choose Common Components Wisely:
    • Use components like object storage, version control, and orchestration systems that facilitate collaboration.
    • Avoid a one-size-fits-all approach.
  2. Architecture is Leadership:
    • Mentor others and provide training on common components.
    • Seek mentorship from data architects.
  3. Always Be Architecting:
    • Data architecture is an ongoing process.
    • Build systems that evolve with organizational needs.
  4. Build Loosely Coupled Systems: Use interchangeable components for flexibility.
  5. Make Reversible Decisions: Design systems that allow for easy changes.
  6. Plan for Failure: Anticipate system failures and design for resilience.
  7. Architect for Scalability: Build systems that scale up and down with demand.
  8. Prioritize Security: Implement zero-trust security (no default trust, authenticate every action).
  9. Embrace FinOps: Optimize costs while maximizing revenue potential.

Planning for Failure

  • Key Metrics:
    • Availability (Uptime): Percentage of time a system is operational (e.g., 99.99% uptime = ~1 hour downtime/year).
    • Reliability: Probability of a system performing its function within defined standards.
    • Durability: Ability to withstand data loss (e.g., Amazon S3 offers 99.999999999% durability).
  • Recovery Objectives:
    • RTO (Recovery Time Objective): Maximum acceptable downtime.
    • RPO (Recovery Point Objective): Maximum acceptable data loss.
  • Security:
    • Zero-Trust Security: Authenticate every action; no default trust.
    • Avoid hardened perimeter security (trust inside, untrusted outside).
  • Cost and Scalability:
    • Use FinOps to manage dynamic Cloud costs (e.g., on-demand vs. spot instances).
    • Scale systems to handle demand spikes without crashing.

Batch Architecture

  • Definition: Processes data in chunks (batches) at fixed intervals.
  • Use Case: When real-time analysis is not critical (e.g., daily sales reports).
  • ETL vs. ELT:
    • ETL (Extract, Transform, Load): Transform data before loading into a data warehouse.
    • ELT (Extract, Load, Transform): Load data into a warehouse first, then transform.
  • Data Marts:
    • Subsets of a data warehouse focused on specific departments (e.g., sales, marketing).
    • Improve query performance and accessibility for analysts.
  • Key Considerations:
    • Choose common components for collaboration.
    • Plan for source system failures or schema changes.
    • Optimize for cost and performance.

Streaming Architecture

  • Definition: Processes data in near real-time as a continuous stream.
  • Components:
    • Producer: Data source (e.g., clickstream data, IoT devices).
    • Consumer: Service or application that processes data.
    • Streaming Broker: Coordinates data between producers and consumers.
  • Lambda Architecture:
    • Combines batch and streaming processing.
    • Challenges: Managing parallel systems with different codebases.
  • Kappa Architecture:
    • Uses a stream processing platform as the backbone.
    • Treats batch processing as a special case of streaming.
  • Modern Approaches:
    • Apache Beam: Unifies batch and streaming with a single codebase.
    • Apache Flink: Popular stream processing tool.
  • Key Takeaway: Streaming architectures are essential for real-time analytics and event-based systems.

Architecting for Compliance

  • Importance: Avoid lawsuits and fines by adhering to regulations.
  • Key Regulations:
    • :
      • Protects personal data in the EU.
      • Requires consent and the ability to delete data.
    • : Protects sensitive patient data in the US.
    • Sarbanes-Oxley Act: Mandates financial reporting and record-keeping.
  • Best Practices:
    • Build systems that comply with modern regulations (e.g., GDPR).
    • Design flexible, loosely coupled systems to adapt to regulatory changes.

Key Takeaways

  1. Data Architecture: Supports evolving data needs and aligns with enterprise goals.
  2. Principles: Choose common components, prioritize security, and embrace FinOps.
  3. Batch vs. Streaming: Batch for periodic processing; streaming for real-time analytics.
  4. Compliance: Build systems that adhere to regulations like GDPR and HIPAA.
  5. Flexibility: Design systems that can adapt to changing business needs and regulations.

Source: DeepLearning.ai data engineering course.