Data Pipelines
Data Ingestion
Overview
-
What is Data Ingestion?
- The process of collecting raw data from various sources.
- Sources include databases, APIs, files, and streaming systems.
- Data is ingested into a target system for storage, processing, and analysis.
-
Batch vs. Streaming Ingestion:
- Batch Ingestion: Processing data in chunks or batches (e.g., hourly, daily, weekly).
- Streaming Ingestion: Processing data as a continuous stream (e.g., real-time).
Batch vs. Streaming Ingestion
-
Batch Ingestion:
- Processes data in bounded chunks (e.g., by time, size, or number of records).
- Examples:
- Ingesting all sales data from the past week.
- Ingesting data every 10 GB or every 1000 records.
- Micro-batch: A hybrid approach where batches are processed more frequently (e.g., every minute).
-
Streaming Ingestion:
- Processes data as a continuous, unbounded stream.
- Examples:
- Real-time stock price updates.
- User clicks on a website.
-
Continuum Between Batch and Streaming:
- Batch and streaming exist on a spectrum.
- High-frequency batch ingestion (e.g., every second) can approach streaming.
Ingestion Patterns
-
ETL (Extract, Transform, Load):
- Traditional batch ingestion pattern.
- Extract data → Transform in a staging area → Load into a target system (e.g., data warehouse).
- Best for structured data with clear transformation rules.
-
ELT (Extract, Load, Transform):
- Extract data → Load into a target system → Transform within the target system.
- Best for exploratory analysis or when transformation requirements are unclear.
- Risk of creating a data swamp (unorganized, unmanageable data).
Ingestion from Different Source Systems
-
- Use connectors like JDBC or ODBC.
- Set up ingestion at regular intervals or when new data is recorded.
- Tools: AWS Glue ETL.
-
APIs:
- Set up connections based on API-specific protocols.
- Constraints: Rate limits, data size limits.
- Tools: API client libraries, managed data platforms.
-
Files:
- Use object storage systems (e.g., S3).
- Manual file transfer: SFTP, SCP.
-
Streaming Systems:
- Use message queues or event streaming platforms (e.g., Kafka, Kinesis).
- Example: Ingesting IoT device or sensor data.
Case Study: Batch Ingestion from a REST API
-
Scenario:
- Marketing analyst wants to analyze Spotify data (music trends) alongside product sales.
- Data will be ingested from the Spotify API.
-
Key Requirements:
- Ingest data from a third-party API.
- Store and serve data for analysis.
- Use ELT for flexibility in transformation.
Case Study: Streaming Ingestion from Web Server Logs
-
Scenario:
- Ingest real-time user activity data from a website for a product recommender system.
- Data will be ingested from a Kinesis Data Stream.
-
Key Requirements:
- Separate user activity data from system metrics.
- Ingest data in JSON format.
- Expected message rate: ~1000 events/second (~1 MB/s).
- Retain messages in the stream for 1 day.
Streaming Ingestion Details
-
Message Queues vs. Event Streaming Platforms:
- Message Queues:
- FIFO (First In, First Out).
- Messages are deleted after consumption.
- Event Streaming Platforms:
- Messages are stored in an append-only log.
- Messages can be replayed or reprocessed.
- Examples: Kafka, Kinesis.
- Message Queues:
-
Kafka:
- Topics: Categories for related events.
- Partitions: Subsets of messages within a topic.
- Consumer Groups: Groups of consumers that read from partitions.
-
Kinesis:
- Streams: Equivalent to Kafka topics.
- Shards: Equivalent to Kafka partitions.
- Capacity:
- Each shard supports up to 1 MB/s write and 2 MB/s read.
- On-demand mode: Automatically scales shards based on traffic.
- Provisioned mode: Manually set the number of shards.
Key Takeaways
-
Batch Ingestion:
- Processes data in chunks.
- Best for historical analysis or large datasets.
- Tools: ETL, ELT.
-
Streaming Ingestion:
- Processes data in real-time.
- Best for real-time analytics or event-driven systems.
- Tools: Kafka, Kinesis.
-
Source Systems:
- Databases, APIs, files, and streaming systems each have unique ingestion considerations.
- Choose the right ingestion pattern based on the source system and business use case.
Change Data Capture (CDC)
What is CDC?
- Definition: CDC is a method for capturing and tracking changes (inserts, updates, deletes) in a database and making these changes available for downstream systems.
- Purpose: Ensures data synchronization between source systems (e.g., databases) and target systems (e.g., data warehouses, analytics platforms).
Strategies for Updating Data
-
Full (Full Snapshots) Load:
- Process: Replace the entire dataset in the target system with a fresh copy from the source system.
- Pros:
- Simple to implement.
- Ensures consistency between source and target systems.
- Cons:
- Resource-intensive for large datasets.
- Not suitable for frequent updates.
- Use Case: Best for small datasets or when frequent updates are not required.
-
Incremental (Differential/Delta) Load:
- Process: Only load changes (inserts, updates, deletes) since the last data ingestion.
- Pros:
- Efficient for high-volume data.
- Reduces processing and memory requirements.
- Cons:
- More complex to implement.
- Requires tracking changes (e.g., using a
last_updated_at
column).
- Use Case: Ideal for large datasets or when frequent updates are needed.
Use Cases for CDC
- Database Synchronization:
- Example: Synchronize data between a source PostgreSQL database and a cloud-based data warehouse for analytics.
- Auditing and Compliance:
- Example: Track historical changes in customer purchase data for regulatory purposes.
- Microservices Integration:
- Example: Relay order updates from a purchase order database to shipment and customer service systems.
Approaches to CDC
-
Push-Based CDC:
- Process: Source system pushes changes to the target system in real-time.
- Pros: Near real-time updates.
- Cons: Risk of data loss if the target system is unavailable.
- Example: Triggers in databases that notify downstream systems of changes.
-
Pull-Based CDC:
- Process: Target system periodically polls the source system for changes.
- Pros: More control over when updates are pulled.
- Cons: Introduces latency between changes and updates.
- Example: Querying a database for rows updated since the last pull.
CDC Implementation Patterns
-
Batch-Oriented or Query-Based CDC (Pull-Based):
- Process: Query the database to identify changes using a
last_updated_at
column. - Pros: Simple to implement.
- Cons:
- Adds computational overhead to the source system.
- Requires scanning rows to identify changes.
- Use Case: Suitable for systems where real-time updates are not critical.
- Process: Query the database to identify changes using a
-
Continuous or Log-Based CDC (Pull-Based):
- Process: Read database logs (e.g., transaction logs) to capture changes in real-time.
- Pros:
- Real-time updates without computational overhead.
- No need for additional columns in the source database.
- Cons:
- Requires access to database logs.
- More complex to implement.
- Example: Using tools like Debezium to stream changes to Apache Kafka.
-
Trigger-Based CDC (Push-Based):
- Process: Use database triggers to notify downstream systems of changes.
- Pros: Real-time updates.
- Cons:
- Can negatively impact database write performance.
- Requires careful management of triggers.
- Example: Configuring triggers to send change notifications to a streaming platform.
Key Takeaways
- CDC is essential for maintaining data consistency between source and target systems.
- Full Snapshots are simple but resource-intensive, while Incremental Loads are efficient but more complex.
- Push-Based CDC offers real-time updates but risks data loss, while Pull-Based CDC introduces latency but provides more control.
- Implementation Patterns:
- Query-Based CDC: Simple but adds overhead.
- Log-Based CDC: Real-time and efficient but complex.
- Trigger-Based CDC: Real-time but can impact performance.
Source: DeepLearning.ai source systems course.