Rajanand home page
Rajanand
π» Tech
Home
Spark
SQL
Python
Notes
Contact
Newsletter
Newsletter
Search...
Navigation
Intro to Data Engineering
Data Architecture
Have a great day! π€©
βK
Notes
Overview
Data Engineering
Intro to Data Engineering
Overview
Lifecycle
Undercurrents
Data Architecture
Right Technologies
Requirements Gathering
Data Pipelines
On this page
Data Architecture Overview
Data Architecture in Enterprise Context
Conwayβs Law
Principles of Good Data Architecture
Planning for Failure
Batch Architecture
Streaming Architecture
Architecting for Compliance
Key Takeaways
Intro to Data Engineering
Data Architecture
β
Data Architecture Overview
β
Data Architecture in Enterprise Context
Enterprise Architecture
:
Definition: The design of systems to support change in an enterprise through flexible and reversible decisions.
Four Main Areas
:
Business Architecture
: Strategy and business model.
Application Architecture
: Structure and interaction of key applications.
Technical Architecture
: Software and hardware components.
Data Architecture
: Supports evolving data needs.
Change Management
:
Organizations evolve, and data architecture must adapt.
One-Way vs. Two-Way Doors
:
One-Way Doors
: Irreversible decisions (e.g., selling AWS).
Two-Way Doors
: Reversible decisions (e.g., changing storage classes in S3).
Aim for
two-way door decisions
to handle change effectively.
β
Conwayβs Law
Definition
: The structure of a system reflects the communication structure of the organization.
Implication
:
Siloed departments β Siloed data systems.
Cross-functional collaboration β Integrated data systems.
Key Takeaway
: Understand your organizationβs communication structure to design effective data systems.
β
Principles of Good Data Architecture
Choose Common Components Wisely
:
Use components like object storage, version control, and orchestration systems that facilitate collaboration.
Avoid a one-size-fits-all approach.
Architecture is Leadership
:
Mentor others and provide training on common components.
Seek mentorship from data architects.
Always Be Architecting
:
Data architecture is an ongoing process.
Build systems that evolve with organizational needs.
Build Loosely Coupled Systems
: Use interchangeable components for flexibility.
Make Reversible Decisions
: Design systems that allow for easy changes.
Plan for Failure
: Anticipate system failures and design for resilience.
Architect for Scalability
: Build systems that scale up and down with demand.
Prioritize Security
: Implement
zero-trust security
(no default trust, authenticate every action).
Embrace FinOps
: Optimize costs while maximizing revenue potential.
β
Planning for Failure
Key Metrics
:
Availability (Uptime)
: Percentage of time a system is operational (e.g., 99.99% uptime = ~1 hour downtime/year).
Reliability
: Probability of a system performing its function within defined standards.
Durability
: Ability to withstand data loss (e.g., Amazon S3 offers 99.999999999% durability).
Recovery Objectives
:
RTO (Recovery Time Objective)
: Maximum acceptable downtime.
RPO (Recovery Point Objective)
: Maximum acceptable data loss.
Security
:
Zero-Trust Security
: Authenticate every action; no default trust.
Avoid
hardened perimeter security
(trust inside, untrusted outside).
Cost and Scalability
:
Use
FinOps
to manage dynamic Cloud costs (e.g., on-demand vs. spot instances).
Scale systems to handle demand spikes without crashing.
β
Batch Architecture
Definition
: Processes data in chunks (batches) at fixed intervals.
Use Case
: When real-time analysis is not critical (e.g., daily sales reports).
ETL vs. ELT
:
ETL (Extract, Transform, Load)
: Transform data before loading into a data warehouse.
ELT (Extract, Load, Transform)
: Load data into a warehouse first, then transform.
Data Marts
:
Subsets of a data warehouse focused on specific departments (e.g., sales, marketing).
Improve query performance and accessibility for analysts.
Key Considerations
:
Choose common components for collaboration.
Plan for source system failures or schema changes.
Optimize for cost and performance.
β
Streaming Architecture
Definition
: Processes data in near real-time as a continuous stream.
Components
:
Producer
: Data source (e.g., clickstream data, IoT devices).
Consumer
: Service or application that processes data.
Streaming Broker
: Coordinates data between producers and consumers.
Lambda Architecture
:
Combines batch and streaming processing.
Challenges: Managing parallel systems with different codebases.
Kappa Architecture
:
Uses a stream processing platform as the backbone.
Treats batch processing as a special case of streaming.
Modern Approaches
:
Apache Beam
: Unifies batch and streaming with a single codebase.
Apache Flink
: Popular stream processing tool.
Key Takeaway
: Streaming architectures are essential for real-time analytics and event-based systems.
β
Architecting for Compliance
Importance
: Avoid lawsuits and fines by adhering to regulations.
Key Regulations
:
GDPR
:
Protects personal data in the EU.
Requires consent and the ability to delete data.
HIPAA
: Protects sensitive patient data in the US.
Sarbanes-Oxley Act
: Mandates financial reporting and record-keeping.
Best Practices
:
Build systems that comply with modern regulations (e.g., GDPR).
Design flexible, loosely coupled systems to adapt to regulatory changes.
β
Key Takeaways
Data Architecture
: Supports evolving data needs and aligns with enterprise goals.
Principles
: Choose common components, prioritize security, and embrace FinOps.
Batch vs. Streaming
: Batch for periodic processing; streaming for real-time analytics.
Compliance
: Build systems that adhere to regulations like GDPR and HIPAA.
Flexibility
: Design systems that can adapt to changing business needs and regulations.
Source
: DeepLearning.ai data engineering course.
Data Engineering Undercurrents
Previous
Choosing Right Technologies
Next
Assistant
Responses are generated using AI and may contain mistakes.