Data Ingestion

Overview

What is Data Ingestion?
- The process of collecting raw data from various sources.
- Sources include databases, APIs, files, and streaming systems.
- Data is ingested into a target system for storage, processing, and analysis.
Batch vs. Streaming Ingestion:
- Batch Ingestion: Processing data in chunks or batches (e.g., hourly, daily, weekly).
- Streaming Ingestion: Processing data as a continuous stream (e.g., real-time).

Batch vs. Streaming Ingestion

Batch Ingestion:
- Processes data in bounded chunks (e.g., by time, size, or number of records).
- Examples:
  - Ingesting all sales data from the past week.
  - Ingesting data every 10 GB or every 1000 records.
- Micro-batch: A hybrid approach where batches are processed more frequently (e.g., every minute).
Streaming Ingestion:
- Processes data as a continuous, unbounded stream.
- Examples:
  - Real-time stock price updates.
  - User clicks on a website.
Continuum Between Batch and Streaming:
- Batch and streaming exist on a spectrum.
- High-frequency batch ingestion (e.g., every second) can approach streaming.

Ingestion Patterns

ETL (Extract, Transform, Load):
- Traditional batch ingestion pattern.
- Extract data → Transform in a staging area → Load into a target system (e.g., data warehouse).
- Best for structured data with clear transformation rules.
ELT (Extract, Load, Transform):
- Extract data → Load into a target system → Transform within the target system.
- Best for exploratory analysis or when transformation requirements are unclear.
- Risk of creating a data swamp (unorganized, unmanageable data).

Ingestion from Different Source Systems

Databases:
- Use connectors like JDBC or ODBC.
- Set up ingestion at regular intervals or when new data is recorded.
- Tools: AWS Glue ETL.
APIs:
- Set up connections based on API-specific protocols.
- Constraints: Rate limits, data size limits.
- Tools: API client libraries, managed data platforms.
Files:
- Use object storage systems (e.g., S3).
- Manual file transfer: SFTP, SCP.
Streaming Systems:
- Use message queues or event streaming platforms (e.g., Kafka, Kinesis).
- Example: Ingesting IoT device or sensor data.

Case Study: Batch Ingestion from a REST API

Scenario:
- Marketing analyst wants to analyze Spotify data (music trends) alongside product sales.
- Data will be ingested from the Spotify API.
Key Requirements:
- Ingest data from a third-party API.
- Store and serve data for analysis.
- Use ELT for flexibility in transformation.

Case Study: Streaming Ingestion from Web Server Logs

Scenario:
- Ingest real-time user activity data from a website for a product recommender system.
- Data will be ingested from a Kinesis Data Stream.
Key Requirements:
- Separate user activity data from system metrics.
- Ingest data in JSON format.
- Expected message rate: ~1000 events/second (~1 MB/s).
- Retain messages in the stream for 1 day.

Streaming Ingestion Details

Message Queues vs. Event Streaming Platforms:
- Message Queues:
  - FIFO (First In, First Out).
  - Messages are deleted after consumption.
- Event Streaming Platforms:
  - Messages are stored in an append-only log.
  - Messages can be replayed or reprocessed.
  - Examples: Kafka, Kinesis.
Kafka:
- Topics: Categories for related events.
- Partitions: Subsets of messages within a topic.
- Consumer Groups: Groups of consumers that read from partitions.
Kinesis:
- Streams: Equivalent to Kafka topics.
- Shards: Equivalent to Kafka partitions.
- Capacity:
  - Each shard supports up to 1 MB/s write and 2 MB/s read.
  - On-demand mode: Automatically scales shards based on traffic.
  - Provisioned mode: Manually set the number of shards.

Key Takeaways

Batch Ingestion:
- Processes data in chunks.
- Best for historical analysis or large datasets.
- Tools: ETL, ELT.
Streaming Ingestion:
- Processes data in real-time.
- Best for real-time analytics or event-driven systems.
- Tools: Kafka, Kinesis.
Source Systems:
- Databases, APIs, files, and streaming systems each have unique ingestion considerations.
- Choose the right ingestion pattern based on the source system and business use case.

Change Data Capture (CDC)

What is CDC?

Definition: CDC is a method for capturing and tracking changes (inserts, updates, deletes) in a database and making these changes available for downstream systems.
Purpose: Ensures data synchronization between source systems (e.g., databases) and target systems (e.g., data warehouses, analytics platforms).

Strategies for Updating Data

Full (Full Snapshots) Load:
- Process: Replace the entire dataset in the target system with a fresh copy from the source system.
- Pros:
  - Simple to implement.
  - Ensures consistency between source and target systems.
- Cons:
  - Resource-intensive for large datasets.
  - Not suitable for frequent updates.
- Use Case: Best for small datasets or when frequent updates are not required.
Incremental (Differential/Delta) Load:
- Process: Only load changes (inserts, updates, deletes) since the last data ingestion.
- Pros:
  - Efficient for high-volume data.
  - Reduces processing and memory requirements.
- Cons:
  - More complex to implement.
  - Requires tracking changes (e.g., using a last_updated_at column).
- Use Case: Ideal for large datasets or when frequent updates are needed.

Use Cases for CDC

Database Synchronization:
- Example: Synchronize data between a source PostgreSQL database and a cloud-based data warehouse for analytics.
Auditing and Compliance:
- Example: Track historical changes in customer purchase data for regulatory purposes.
Microservices Integration:
- Example: Relay order updates from a purchase order database to shipment and customer service systems.

Approaches to CDC

Push-Based CDC:
- Process: Source system pushes changes to the target system in real-time.
- Pros: Near real-time updates.
- Cons: Risk of data loss if the target system is unavailable.
- Example: Triggers in databases that notify downstream systems of changes.
Pull-Based CDC:
- Process: Target system periodically polls the source system for changes.
- Pros: More control over when updates are pulled.
- Cons: Introduces latency between changes and updates.
- Example: Querying a database for rows updated since the last pull.

CDC Implementation Patterns

Batch-Oriented or Query-Based CDC (Pull-Based):
- Process: Query the database to identify changes using a last_updated_at column.
- Pros: Simple to implement.
- Cons:
  - Adds computational overhead to the source system.
  - Requires scanning rows to identify changes.
- Use Case: Suitable for systems where real-time updates are not critical.
Continuous or Log-Based CDC (Pull-Based):
- Process: Read database logs (e.g., transaction logs) to capture changes in real-time.
- Pros:
  - Real-time updates without computational overhead.
  - No need for additional columns in the source database.
- Cons:
  - Requires access to database logs.
  - More complex to implement.
- Example: Using tools like Debezium to stream changes to Apache Kafka.
Trigger-Based CDC (Push-Based):
- Process: Use database triggers to notify downstream systems of changes.
- Pros: Real-time updates.
- Cons:
  - Can negatively impact database write performance.
  - Requires careful management of triggers.
- Example: Configuring triggers to send change notifications to a streaming platform.

Key Takeaways

CDC is essential for maintaining data consistency between source and target systems.
Full Snapshots are simple but resource-intensive, while Incremental Loads are efficient but more complex.
Push-Based CDC offers real-time updates but risks data loss, while Pull-Based CDC introduces latency but provides more control.
Implementation Patterns:
- Query-Based CDC: Simple but adds overhead.
- Log-Based CDC: Real-time and efficient but complex.
- Trigger-Based CDC: Real-time but can impact performance.

Source: DeepLearning.ai source systems course.

Intro to Data Engineering

Data Pipelines

Overview

Batch vs. Streaming Ingestion

Ingestion Patterns

Ingestion from Different Source Systems

Case Study: Batch Ingestion from a REST API

Case Study: Streaming Ingestion from Web Server Logs

Streaming Ingestion Details

Key Takeaways

Change Data Capture (CDC)

What is CDC?

Strategies for Updating Data

Use Cases for CDC

Approaches to CDC

CDC Implementation Patterns

Key Takeaways

Intro to Data Engineering

Data Pipelines

​Overview

​Batch vs. Streaming Ingestion

​Ingestion Patterns

​Ingestion from Different Source Systems

​Case Study: Batch Ingestion from a REST API

​Case Study: Streaming Ingestion from Web Server Logs

​Streaming Ingestion Details

​Key Takeaways

​Change Data Capture (CDC)

​What is CDC?

​Strategies for Updating Data

​Use Cases for CDC

​Approaches to CDC

​CDC Implementation Patterns

​Key Takeaways

Overview

Batch vs. Streaming Ingestion

Ingestion Patterns

Ingestion from Different Source Systems

Case Study: Batch Ingestion from a REST API

Case Study: Streaming Ingestion from Web Server Logs

Streaming Ingestion Details

Key Takeaways

Change Data Capture (CDC)

What is CDC?

Strategies for Updating Data

Use Cases for CDC

Approaches to CDC

CDC Implementation Patterns

Key Takeaways