Source Systems

Introduction to Source Systems

Key Concepts

Data Engineering Lifecycle:

Stages: Data generation, ingestion, transformation, storage, serving, and monitoring.
Importance of Ingestion: 80% of AI/ML work involves data engineering, yet often overlooked.

Source Systems:

Types:
- Databases: Relational (SQL) and non-relational (NoSQL).
- Files: Structured (CSV), semi-structured (JSON), unstructured (text, images, videos).
- Streaming Systems: Continuous data flow (e.g., IoT devices, logs).

Data Types:

Structured: Tabular data (rows and columns).
Semi-structured: JSON, XML (key-value pairs, nested structures).
Unstructured: Text, images, audio, video.

Databases

Relational Databases:

Structure: Tables with rows and columns, linked by keys.
CRUD Operations: Create, Read, Update, Delete.
Normalization: Minimizes redundancy by splitting data into related tables.
Primary Key: Uniquely identifies each row.
Foreign Key: Links tables by referencing a primary key.
SQL: Standard language for querying relational databases (e.g., SELECT, JOIN, WHERE).

NoSQL Databases:

Types: Key-value, document, wide-column, graph.
Flexibility: No fixed schema, supports unstructured/semi-structured data.
Scalability: Horizontal scaling across multiple servers.
Eventual Consistency: Data may not be immediately consistent across nodes.
Examples: MongoDB (document store), DynamoDB (key-value store).

ACID Principles:

Atomicity: Transactions are all-or-nothing.
Consistency: Data remains valid after transactions.
Isolation: Concurrent transactions don’t interfere.
Durability: Completed transactions are permanent.

Files and Object Storage

Files:

Types: CSV, JSON, text, images, videos.
Source Systems: File systems (Google Drive), object storage (Amazon S3, ADLS, GCS).

Object Storage:

Structure: Flat, no hierarchy (despite folder-like UIs).
Key Features:
- UUID: Unique identifier for each object.
- Metadata: Additional info (e.g., creation date, file type).
- Immutable: Objects can’t be updated; new versions are created.
Use Cases: Data lakes, machine learning datasets.

Streaming Systems

Logs:

Definition: Append-only records of events (e.g., user activity, system errors).
Use Cases: Monitoring, debugging, anomaly detection.
Log Levels: Debug, Info, Warn, Error, Fatal.

Streaming Data:

Components:
- Event Producer: Generates messages (e.g., IoT devices, APIs).
- Event Consumer: Processes messages (e.g., payment systems, inventory updates).
- Event Router: Distributes messages (e.g., Apache Kafka, Amazon Kinesis).
Types:
- Message Queues: Temporary storage (e.g., Amazon SQS).
- Streaming Platforms: Persistent logs (e.g., Apache Kafka).

DataOps and Orchestration

DataOps:

Focus: Automation, monitoring, and quality assurance in data pipelines.
Tools: Infrastructure as code, monitoring tools.

Orchestration:

DAGs: Directed Acyclic Graphs for workflow management (e.g., Apache Airflow).
Tasks: Automating pipeline tasks, ensuring data quality.

Connecting to Source Systems

Methods of Connection:

AWS Management Console: Manual, not repeatable or traceable.
Command Line Interface (CLI): More programmatic but still manual.
SDKs (e.g., Boto3): Programmatic and repeatable, ideal for automation.
API Connectors (e.g., JDBC, ODBC): Used to connect applications to databases.

Key Components for Connection:

Endpoint: The address of the resource (e.g., database endpoint).
Port: The communication endpoint.
Credentials: Username, password, or access keys for authentication.

Best Practices:

Avoid manual processes for repeatability and traceability.
Use SDKs or APIs for automation and scalability.

Identity and Access Management

Identity and Access Management (IAM) Basics:

Purpose: Manage permissions for accessing cloud resources.
Principle of Least Privilege: Grant only the necessary permissions for a limited time.

IAM Components:

Root User: The account creator with full access.
IAM Users: Individuals with specific permissions.
IAM Groups: Collections of users with shared permissions.
IAM Roles: Temporary permissions for users, applications, or services.
Policies: JSON documents defining permissions for resources.

Common IAM Issues:

Misconfigured permissions.
Expired temporary credentials.
Storing credentials insecurely (e.g., public GitHub repositories).

Best Practices:

Use roles for temporary access instead of long-term credentials.
Regularly review and update IAM policies.
Avoid granting excessive permissions.

Networking in the Cloud

Key Concepts:

Regions and Availability Zones (AZs):
- Regions: Geographic areas with multiple AZs.
- AZs: Isolated data centers within a region.
Virtual Private Cloud (VPC): A private network within AWS.
Subnets: Divisions within a VPC for organizing resources.
Internet Gateway: Allows public subnets to connect to the internet.
NAT Gateway: Allows private subnets to access the internet without exposing them to inbound traffic.

Route Tables: Define how traffic is routed within a VPC.

Public Subnets: Route traffic to the internet via an internet gateway.
Private Subnets: Route traffic to the internet via a NAT gateway.

Security Groups and Network ACLs:

Security Groups:
- Act as virtual firewalls at the instance level.
- Stateful: Allow return traffic automatically.
- Commonly used for EC2 instances, RDS databases, and load balancers.
Network ACLs:
- Provide subnet-level security.
- Stateless: Require explicit inbound and outbound rules.
- Useful for granular control over traffic.

Common Networking Issues:

Misconfigured route tables.
Missing or incorrect security group rules.
Incorrect subnet associations.

Best Practices:

Use public subnets for internet-facing resources (e.g., load balancers).
Use private subnets for internal resources (e.g., databases).
Regularly review and update security group and network ACL rules.

Troubleshooting Connectivity Issues

Steps to Debug:

Verify the VPC has an internet gateway attached.
Check route tables for correct routing rules.
Ensure subnets are associated with the correct route tables.
Review security groups for necessary inbound/outbound rules.
Check network ACLs for traffic restrictions.
Confirm instances are in the correct subnets and associated with the right security groups.

Common Scenarios:

Access Denied: Check IAM permissions and credentials.
No Internet Access: Verify route tables and NAT gateway configurations.
Connection Timeout: Check security groups and network ACLs.

Key Takeaways

Data Ingestion: Critical for building robust data pipelines.
Source Systems: Databases, files, and streaming systems are the primary sources.
Data Types: Structured, semi-structured, and unstructured data require different handling.
Tools: SQL, NoSQL, object storage (S3), streaming platforms (Kafka), and orchestration tools (Airflow) are essential for data engineers.
ACID Compliance: Ensures data integrity in transactional systems.
IAM: Central to managing access and permissions in cloud-based architectures.
Networking: Understanding VPCs, subnets, route tables, and security groups is critical for building and troubleshooting data pipelines.

Source: DeepLearning.ai source systems course.

Intro to Data Engineering

Data Pipelines

Introduction to Source Systems

Key Concepts

Databases

Files and Object Storage

Streaming Systems

DataOps and Orchestration

Connecting to Source Systems

Identity and Access Management

Networking in the Cloud

Troubleshooting Connectivity Issues

Key Takeaways

Intro to Data Engineering

Data Pipelines

​Introduction to Source Systems

​Key Concepts

​Databases

​Files and Object Storage

​Streaming Systems

​DataOps and Orchestration

​Connecting to Source Systems

​Identity and Access Management

​Networking in the Cloud

​Troubleshooting Connectivity Issues

​Key Takeaways

Introduction to Source Systems

Key Concepts

Databases

Files and Object Storage

Streaming Systems

DataOps and Orchestration

Connecting to Source Systems

Identity and Access Management

Networking in the Cloud

Troubleshooting Connectivity Issues

Key Takeaways