Introduction to Source Systems

Key Concepts

  1. Data Engineering Lifecycle:
  • Stages: Data generation, ingestion, transformation, storage, serving, and monitoring.
  • Importance of Ingestion: 80% of AI/ML work involves data engineering, yet often overlooked.
  1. Source Systems:
  • Types:
    • Databases: Relational (SQL) and non-relational (NoSQL).
    • Files: Structured (CSV), semi-structured (JSON), unstructured (text, images, videos).
    • Streaming Systems: Continuous data flow (e.g., IoT devices, logs).
  1. Data Types:
  • Structured: Tabular data (rows and columns).
  • Semi-structured: JSON, XML (key-value pairs, nested structures).
  • Unstructured: Text, images, audio, video.

Databases

  1. Relational Databases:
  • Structure: Tables with rows and columns, linked by keys.
  • CRUD Operations: Create, Read, Update, Delete.
  • Normalization: Minimizes redundancy by splitting data into related tables.
  • Primary Key: Uniquely identifies each row.
  • Foreign Key: Links tables by referencing a primary key.
  • SQL: Standard language for querying relational databases (e.g., SELECT, JOIN, WHERE).
  1. NoSQL Databases:
  • Types: Key-value, document, wide-column, graph.
  • Flexibility: No fixed schema, supports unstructured/semi-structured data.
  • Scalability: Horizontal scaling across multiple servers.
  • Eventual Consistency: Data may not be immediately consistent across nodes.
  • Examples: MongoDB (document store), DynamoDB (key-value store).
  1. ACID Principles:
  • Atomicity: Transactions are all-or-nothing.
  • Consistency: Data remains valid after transactions.
  • Isolation: Concurrent transactions don’t interfere.
  • Durability: Completed transactions are permanent.

Files and Object Storage

  1. Files:
  • Types: CSV, JSON, text, images, videos.
  • Source Systems: File systems (Google Drive), object storage (Amazon S3, ADLS, GCS).
  1. Object Storage:
  • Structure: Flat, no hierarchy (despite folder-like UIs).
  • Key Features:
    • UUID: Unique identifier for each object.
    • Metadata: Additional info (e.g., creation date, file type).
    • Immutable: Objects can’t be updated; new versions are created.
  • Use Cases: Data lakes, machine learning datasets.

Streaming Systems

  1. Logs:
  • Definition: Append-only records of events (e.g., user activity, system errors).
  • Use Cases: Monitoring, debugging, anomaly detection.
  • Log Levels: Debug, Info, Warn, Error, Fatal.
  1. Streaming Data:
  • Components:
    • Event Producer: Generates messages (e.g., IoT devices, APIs).
    • Event Consumer: Processes messages (e.g., payment systems, inventory updates).
    • Event Router: Distributes messages (e.g., Apache Kafka, Amazon Kinesis).
  • Types:
    • Message Queues: Temporary storage (e.g., Amazon SQS).
    • Streaming Platforms: Persistent logs (e.g., Apache Kafka).

DataOps and Orchestration

  1. DataOps:
  • Focus: Automation, monitoring, and quality assurance in data pipelines.
  • Tools: Infrastructure as code, monitoring tools.
  1. Orchestration:
  • DAGs: Directed Acyclic Graphs for workflow management (e.g., Apache Airflow).
  • Tasks: Automating pipeline tasks, ensuring data quality.

Connecting to Source Systems

  1. Methods of Connection:
  • AWS Management Console: Manual, not repeatable or traceable.
  • Command Line Interface (CLI): More programmatic but still manual.
  • SDKs (e.g., Boto3): Programmatic and repeatable, ideal for automation.
  • API Connectors (e.g., JDBC, ODBC): Used to connect applications to databases.
  1. Key Components for Connection:
  • Endpoint: The address of the resource (e.g., database endpoint).
  • Port: The communication endpoint.
  • Credentials: Username, password, or access keys for authentication.
  1. Best Practices:
  • Avoid manual processes for repeatability and traceability.
  • Use SDKs or APIs for automation and scalability.

Identity and Access Management

  1. Identity and Access Management (IAM) Basics:
  • Purpose: Manage permissions for accessing cloud resources.
  • Principle of Least Privilege: Grant only the necessary permissions for a limited time.
  1. IAM Components:
  • Root User: The account creator with full access.
  • IAM Users: Individuals with specific permissions.
  • IAM Groups: Collections of users with shared permissions.
  • IAM Roles: Temporary permissions for users, applications, or services.
  • Policies: JSON documents defining permissions for resources.
  1. Common IAM Issues:
  • Misconfigured permissions.
  • Expired temporary credentials.
  • Storing credentials insecurely (e.g., public GitHub repositories).
  1. Best Practices:
  • Use roles for temporary access instead of long-term credentials.
  • Regularly review and update IAM policies.
  • Avoid granting excessive permissions.

Networking in the Cloud

  1. Key Concepts:
  • Regions and Availability Zones (AZs):
    • Regions: Geographic areas with multiple AZs.
    • AZs: Isolated data centers within a region.
  • Virtual Private Cloud (VPC): A private network within AWS.
  • Subnets: Divisions within a VPC for organizing resources.
  • Internet Gateway: Allows public subnets to connect to the internet.
  • NAT Gateway: Allows private subnets to access the internet without exposing them to inbound traffic.
  1. Route Tables: Define how traffic is routed within a VPC.
  • Public Subnets: Route traffic to the internet via an internet gateway.
  • Private Subnets: Route traffic to the internet via a NAT gateway.
  1. Security Groups and Network ACLs:
  • Security Groups:
    • Act as virtual firewalls at the instance level.
    • Stateful: Allow return traffic automatically.
    • Commonly used for EC2 instances, RDS databases, and load balancers.
  • Network ACLs:
    • Provide subnet-level security.
    • Stateless: Require explicit inbound and outbound rules.
    • Useful for granular control over traffic.
  1. Common Networking Issues:
  • Misconfigured route tables.
  • Missing or incorrect security group rules.
  • Incorrect subnet associations.
  1. Best Practices:
  • Use public subnets for internet-facing resources (e.g., load balancers).
  • Use private subnets for internal resources (e.g., databases).
  • Regularly review and update security group and network ACL rules.

Troubleshooting Connectivity Issues

  1. Steps to Debug:
  • Verify the VPC has an internet gateway attached.
  • Check route tables for correct routing rules.
  • Ensure subnets are associated with the correct route tables.
  • Review security groups for necessary inbound/outbound rules.
  • Check network ACLs for traffic restrictions.
  • Confirm instances are in the correct subnets and associated with the right security groups.
  1. Common Scenarios:
  • Access Denied: Check IAM permissions and credentials.
  • No Internet Access: Verify route tables and NAT gateway configurations.
  • Connection Timeout: Check security groups and network ACLs.

Key Takeaways

  • Data Ingestion: Critical for building robust data pipelines.
  • Source Systems: Databases, files, and streaming systems are the primary sources.
  • Data Types: Structured, semi-structured, and unstructured data require different handling.
  • Tools: SQL, NoSQL, object storage (S3), streaming platforms (Kafka), and orchestration tools (Airflow) are essential for data engineers.
  • ACID Compliance: Ensures data integrity in transactional systems.
  • IAM: Central to managing access and permissions in cloud-based architectures.
  • Networking: Understanding VPCs, subnets, route tables, and security groups is critical for building and troubleshooting data pipelines.

Source: DeepLearning.ai source systems course.