Intro to Data Engineering
Data Architecture
Data Architecture Overview
Data Architecture in Enterprise Context
- Enterprise Architecture:
- Definition: The design of systems to support change in an enterprise through flexible and reversible decisions.
- Four Main Areas:
- Business Architecture: Strategy and business model.
- Application Architecture: Structure and interaction of key applications.
- Technical Architecture: Software and hardware components.
- Data Architecture: Supports evolving data needs.
- Change Management:
- Organizations evolve, and data architecture must adapt.
- One-Way vs. Two-Way Doors:
- One-Way Doors: Irreversible decisions (e.g., selling AWS).
- Two-Way Doors: Reversible decisions (e.g., changing storage classes in S3).
- Aim for two-way door decisions to handle change effectively.
Conway’s Law
- Definition: The structure of a system reflects the communication structure of the organization.
- Implication:
- Siloed departments → Siloed data systems.
- Cross-functional collaboration → Integrated data systems.
- Key Takeaway: Understand your organization’s communication structure to design effective data systems.
Principles of Good Data Architecture
- Choose Common Components Wisely:
- Use components like object storage, version control, and orchestration systems that facilitate collaboration.
- Avoid a one-size-fits-all approach.
- Architecture is Leadership:
- Mentor others and provide training on common components.
- Seek mentorship from data architects.
- Always Be Architecting:
- Data architecture is an ongoing process.
- Build systems that evolve with organizational needs.
- Build Loosely Coupled Systems: Use interchangeable components for flexibility.
- Make Reversible Decisions: Design systems that allow for easy changes.
- Plan for Failure: Anticipate system failures and design for resilience.
- Architect for Scalability: Build systems that scale up and down with demand.
- Prioritize Security: Implement zero-trust security (no default trust, authenticate every action).
- Embrace FinOps: Optimize costs while maximizing revenue potential.
Planning for Failure
- Key Metrics:
- Availability (Uptime): Percentage of time a system is operational (e.g., 99.99% uptime = ~1 hour downtime/year).
- Reliability: Probability of a system performing its function within defined standards.
- Durability: Ability to withstand data loss (e.g., Amazon S3 offers 99.999999999% durability).
- Recovery Objectives:
- RTO (Recovery Time Objective): Maximum acceptable downtime.
- RPO (Recovery Point Objective): Maximum acceptable data loss.
- Security:
- Zero-Trust Security: Authenticate every action; no default trust.
- Avoid hardened perimeter security (trust inside, untrusted outside).
- Cost and Scalability:
- Use FinOps to manage dynamic Cloud costs (e.g., on-demand vs. spot instances).
- Scale systems to handle demand spikes without crashing.
Batch Architecture
- Definition: Processes data in chunks (batches) at fixed intervals.
- Use Case: When real-time analysis is not critical (e.g., daily sales reports).
- ETL vs. ELT:
- ETL (Extract, Transform, Load): Transform data before loading into a data warehouse.
- ELT (Extract, Load, Transform): Load data into a warehouse first, then transform.
- Data Marts:
- Subsets of a data warehouse focused on specific departments (e.g., sales, marketing).
- Improve query performance and accessibility for analysts.
- Key Considerations:
- Choose common components for collaboration.
- Plan for source system failures or schema changes.
- Optimize for cost and performance.
Streaming Architecture
- Definition: Processes data in near real-time as a continuous stream.
- Components:
- Producer: Data source (e.g., clickstream data, IoT devices).
- Consumer: Service or application that processes data.
- Streaming Broker: Coordinates data between producers and consumers.
- Lambda Architecture:
- Combines batch and streaming processing.
- Challenges: Managing parallel systems with different codebases.
- Kappa Architecture:
- Uses a stream processing platform as the backbone.
- Treats batch processing as a special case of streaming.
- Modern Approaches:
- Apache Beam: Unifies batch and streaming with a single codebase.
- Apache Flink: Popular stream processing tool.
- Key Takeaway: Streaming architectures are essential for real-time analytics and event-based systems.
Architecting for Compliance
- Importance: Avoid lawsuits and fines by adhering to regulations.
- Key Regulations:
- :
- Protects personal data in the EU.
- Requires consent and the ability to delete data.
- : Protects sensitive patient data in the US.
- Sarbanes-Oxley Act: Mandates financial reporting and record-keeping.
- :
- Best Practices:
- Build systems that comply with modern regulations (e.g., GDPR).
- Design flexible, loosely coupled systems to adapt to regulatory changes.
Key Takeaways
- Data Architecture: Supports evolving data needs and aligns with enterprise goals.
- Principles: Choose common components, prioritize security, and embrace FinOps.
- Batch vs. Streaming: Batch for periodic processing; streaming for real-time analytics.
- Compliance: Build systems that adhere to regulations like GDPR and HIPAA.
- Flexibility: Design systems that can adapt to changing business needs and regulations.
Source: DeepLearning.ai data engineering course.