Overview

  • What is Data Ingestion?

    • The process of collecting raw data from various sources.
    • Sources include databases, APIs, files, and streaming systems.
    • Data is ingested into a target system for storage, processing, and analysis.
  • Batch vs. Streaming Ingestion:

    • Batch Ingestion: Processing data in chunks or batches (e.g., hourly, daily, weekly).
    • Streaming Ingestion: Processing data as a continuous stream (e.g., real-time).

Batch vs. Streaming Ingestion

  • Batch Ingestion:

    • Processes data in bounded chunks (e.g., by time, size, or number of records).
    • Examples:
      • Ingesting all sales data from the past week.
      • Ingesting data every 10 GB or every 1000 records.
    • Micro-batch: A hybrid approach where batches are processed more frequently (e.g., every minute).
  • Streaming Ingestion:

    • Processes data as a continuous, unbounded stream.
    • Examples:
      • Real-time stock price updates.
      • User clicks on a website.
  • Continuum Between Batch and Streaming:

    • Batch and streaming exist on a spectrum.
    • High-frequency batch ingestion (e.g., every second) can approach streaming.

Ingestion Patterns

  • ETL (Extract, Transform, Load):

    • Traditional batch ingestion pattern.
    • Extract data → Transform in a staging area → Load into a target system (e.g., data warehouse).
    • Best for structured data with clear transformation rules.
  • ELT (Extract, Load, Transform):

    • Extract data → Load into a target system → Transform within the target system.
    • Best for exploratory analysis or when transformation requirements are unclear.
    • Risk of creating a data swamp (unorganized, unmanageable data).

Ingestion from Different Source Systems

  • Databases:

    • Use connectors like JDBC or ODBC.
    • Set up ingestion at regular intervals or when new data is recorded.
    • Tools: AWS Glue ETL.
  • APIs:

    • Set up connections based on API-specific protocols.
    • Constraints: Rate limits, data size limits.
    • Tools: API client libraries, managed data platforms.
  • Files:

    • Use object storage systems (e.g., S3).
    • Manual file transfer: SFTP, SCP.
  • Streaming Systems:

    • Use message queues or event streaming platforms (e.g., Kafka, Kinesis).
    • Example: Ingesting IoT device or sensor data.

Case Study: Batch Ingestion from a REST API

  • Scenario:

    • Marketing analyst wants to analyze Spotify data (music trends) alongside product sales.
    • Data will be ingested from the Spotify API.
  • Key Requirements:

    • Ingest data from a third-party API.
    • Store and serve data for analysis.
    • Use ELT for flexibility in transformation.

Case Study: Streaming Ingestion from Web Server Logs

  • Scenario:

    • Ingest real-time user activity data from a website for a product recommender system.
    • Data will be ingested from a Kinesis Data Stream.
  • Key Requirements:

    • Separate user activity data from system metrics.
    • Ingest data in JSON format.
    • Expected message rate: ~1000 events/second (~1 MB/s).
    • Retain messages in the stream for 1 day.

Streaming Ingestion Details

  • Message Queues vs. Event Streaming Platforms:

    • Message Queues:
      • FIFO (First In, First Out).
      • Messages are deleted after consumption.
    • Event Streaming Platforms:
      • Messages are stored in an append-only log.
      • Messages can be replayed or reprocessed.
      • Examples: Kafka, Kinesis.
  • Kafka:

    • Topics: Categories for related events.
    • Partitions: Subsets of messages within a topic.
    • Consumer Groups: Groups of consumers that read from partitions.
  • Kinesis:

    • Streams: Equivalent to Kafka topics.
    • Shards: Equivalent to Kafka partitions.
    • Capacity:
      • Each shard supports up to 1 MB/s write and 2 MB/s read.
      • On-demand mode: Automatically scales shards based on traffic.
      • Provisioned mode: Manually set the number of shards.

Key Takeaways

  • Batch Ingestion:

    • Processes data in chunks.
    • Best for historical analysis or large datasets.
    • Tools: ETL, ELT.
  • Streaming Ingestion:

    • Processes data in real-time.
    • Best for real-time analytics or event-driven systems.
    • Tools: Kafka, Kinesis.
  • Source Systems:

    • Databases, APIs, files, and streaming systems each have unique ingestion considerations.
    • Choose the right ingestion pattern based on the source system and business use case.

Change Data Capture (CDC)

What is CDC?

  • Definition: CDC is a method for capturing and tracking changes (inserts, updates, deletes) in a database and making these changes available for downstream systems.
  • Purpose: Ensures data synchronization between source systems (e.g., databases) and target systems (e.g., data warehouses, analytics platforms).

Strategies for Updating Data

  1. Full (Full Snapshots) Load:

    • Process: Replace the entire dataset in the target system with a fresh copy from the source system.
    • Pros:
      • Simple to implement.
      • Ensures consistency between source and target systems.
    • Cons:
      • Resource-intensive for large datasets.
      • Not suitable for frequent updates.
    • Use Case: Best for small datasets or when frequent updates are not required.
  2. Incremental (Differential/Delta) Load:

    • Process: Only load changes (inserts, updates, deletes) since the last data ingestion.
    • Pros:
      • Efficient for high-volume data.
      • Reduces processing and memory requirements.
    • Cons:
      • More complex to implement.
      • Requires tracking changes (e.g., using a last_updated_at column).
    • Use Case: Ideal for large datasets or when frequent updates are needed.

Use Cases for CDC

  1. Database Synchronization:
    • Example: Synchronize data between a source PostgreSQL database and a cloud-based data warehouse for analytics.
  2. Auditing and Compliance:
    • Example: Track historical changes in customer purchase data for regulatory purposes.
  3. Microservices Integration:
    • Example: Relay order updates from a purchase order database to shipment and customer service systems.

Approaches to CDC

  1. Push-Based CDC:

    • Process: Source system pushes changes to the target system in real-time.
    • Pros: Near real-time updates.
    • Cons: Risk of data loss if the target system is unavailable.
    • Example: Triggers in databases that notify downstream systems of changes.
  2. Pull-Based CDC:

    • Process: Target system periodically polls the source system for changes.
    • Pros: More control over when updates are pulled.
    • Cons: Introduces latency between changes and updates.
    • Example: Querying a database for rows updated since the last pull.

CDC Implementation Patterns

  1. Batch-Oriented or Query-Based CDC (Pull-Based):

    • Process: Query the database to identify changes using a last_updated_at column.
    • Pros: Simple to implement.
    • Cons:
      • Adds computational overhead to the source system.
      • Requires scanning rows to identify changes.
    • Use Case: Suitable for systems where real-time updates are not critical.
  2. Continuous or Log-Based CDC (Pull-Based):

    • Process: Read database logs (e.g., transaction logs) to capture changes in real-time.
    • Pros:
      • Real-time updates without computational overhead.
      • No need for additional columns in the source database.
    • Cons:
      • Requires access to database logs.
      • More complex to implement.
    • Example: Using tools like Debezium to stream changes to Apache Kafka.
  3. Trigger-Based CDC (Push-Based):

    • Process: Use database triggers to notify downstream systems of changes.
    • Pros: Real-time updates.
    • Cons:
      • Can negatively impact database write performance.
      • Requires careful management of triggers.
    • Example: Configuring triggers to send change notifications to a streaming platform.

Key Takeaways

  • CDC is essential for maintaining data consistency between source and target systems.
  • Full Snapshots are simple but resource-intensive, while Incremental Loads are efficient but more complex.
  • Push-Based CDC offers real-time updates but risks data loss, while Pull-Based CDC introduces latency but provides more control.
  • Implementation Patterns:
    • Query-Based CDC: Simple but adds overhead.
    • Log-Based CDC: Real-time and efficient but complex.
    • Trigger-Based CDC: Real-time but can impact performance.

Source: DeepLearning.ai source systems course.