Data Pipelines
Source Systems
Introduction to Source Systems
Key Concepts
- Data Engineering Lifecycle:
- Stages: Data generation, ingestion, transformation, storage, serving, and monitoring.
- Importance of Ingestion: 80% of AI/ML work involves data engineering, yet often overlooked.
- Source Systems:
- Types:
- Databases: Relational (SQL) and non-relational (NoSQL).
- Files: Structured (CSV), semi-structured (JSON), unstructured (text, images, videos).
- Streaming Systems: Continuous data flow (e.g., IoT devices, logs).
- Data Types:
- Structured: Tabular data (rows and columns).
- Semi-structured: JSON, XML (key-value pairs, nested structures).
- Unstructured: Text, images, audio, video.
Databases
- Relational Databases:
- Structure: Tables with rows and columns, linked by keys.
- CRUD Operations: Create, Read, Update, Delete.
- Normalization: Minimizes redundancy by splitting data into related tables.
- Primary Key: Uniquely identifies each row.
- Foreign Key: Links tables by referencing a primary key.
- SQL: Standard language for querying relational databases (e.g., SELECT, JOIN, WHERE).
- NoSQL Databases:
- Types: Key-value, document, wide-column, graph.
- Flexibility: No fixed schema, supports unstructured/semi-structured data.
- Scalability: Horizontal scaling across multiple servers.
- Eventual Consistency: Data may not be immediately consistent across nodes.
- Examples: MongoDB (document store), DynamoDB (key-value store).
- ACID Principles:
- Atomicity: Transactions are all-or-nothing.
- Consistency: Data remains valid after transactions.
- Isolation: Concurrent transactions don’t interfere.
- Durability: Completed transactions are permanent.
Files and Object Storage
- Files:
- Types: CSV, JSON, text, images, videos.
- Source Systems: File systems (Google Drive), object storage (Amazon S3, ADLS, GCS).
- Object Storage:
- Structure: Flat, no hierarchy (despite folder-like UIs).
- Key Features:
- UUID: Unique identifier for each object.
- Metadata: Additional info (e.g., creation date, file type).
- Immutable: Objects can’t be updated; new versions are created.
- Use Cases: Data lakes, machine learning datasets.
Streaming Systems
- Logs:
- Definition: Append-only records of events (e.g., user activity, system errors).
- Use Cases: Monitoring, debugging, anomaly detection.
- Log Levels: Debug, Info, Warn, Error, Fatal.
- Streaming Data:
- Components:
- Event Producer: Generates messages (e.g., IoT devices, APIs).
- Event Consumer: Processes messages (e.g., payment systems, inventory updates).
- Event Router: Distributes messages (e.g., Apache Kafka, Amazon Kinesis).
- Types:
- Message Queues: Temporary storage (e.g., Amazon SQS).
- Streaming Platforms: Persistent logs (e.g., Apache Kafka).
DataOps and Orchestration
- DataOps:
- Focus: Automation, monitoring, and quality assurance in data pipelines.
- Tools: Infrastructure as code, monitoring tools.
- Orchestration:
- DAGs: Directed Acyclic Graphs for workflow management (e.g., Apache Airflow).
- Tasks: Automating pipeline tasks, ensuring data quality.
Connecting to Source Systems
- Methods of Connection:
- AWS Management Console: Manual, not repeatable or traceable.
- Command Line Interface (CLI): More programmatic but still manual.
- SDKs (e.g., Boto3): Programmatic and repeatable, ideal for automation.
- API Connectors (e.g., JDBC, ODBC): Used to connect applications to databases.
- Key Components for Connection:
- Endpoint: The address of the resource (e.g., database endpoint).
- Port: The communication endpoint.
- Credentials: Username, password, or access keys for authentication.
- Best Practices:
- Avoid manual processes for repeatability and traceability.
- Use SDKs or APIs for automation and scalability.
Identity and Access Management
- Identity and Access Management (IAM) Basics:
- Purpose: Manage permissions for accessing cloud resources.
- Principle of Least Privilege: Grant only the necessary permissions for a limited time.
- IAM Components:
- Root User: The account creator with full access.
- IAM Users: Individuals with specific permissions.
- IAM Groups: Collections of users with shared permissions.
- IAM Roles: Temporary permissions for users, applications, or services.
- Policies: JSON documents defining permissions for resources.
- Common IAM Issues:
- Misconfigured permissions.
- Expired temporary credentials.
- Storing credentials insecurely (e.g., public GitHub repositories).
- Best Practices:
- Use roles for temporary access instead of long-term credentials.
- Regularly review and update IAM policies.
- Avoid granting excessive permissions.
Networking in the Cloud
- Key Concepts:
- Regions and Availability Zones (AZs):
- Regions: Geographic areas with multiple AZs.
- AZs: Isolated data centers within a region.
- Virtual Private Cloud (VPC): A private network within AWS.
- Subnets: Divisions within a VPC for organizing resources.
- Internet Gateway: Allows public subnets to connect to the internet.
- NAT Gateway: Allows private subnets to access the internet without exposing them to inbound traffic.
- Route Tables: Define how traffic is routed within a VPC.
- Public Subnets: Route traffic to the internet via an internet gateway.
- Private Subnets: Route traffic to the internet via a NAT gateway.
- Security Groups and Network ACLs:
- Security Groups:
- Act as virtual firewalls at the instance level.
- Stateful: Allow return traffic automatically.
- Commonly used for EC2 instances, RDS databases, and load balancers.
- Network ACLs:
- Provide subnet-level security.
- Stateless: Require explicit inbound and outbound rules.
- Useful for granular control over traffic.
- Common Networking Issues:
- Misconfigured route tables.
- Missing or incorrect security group rules.
- Incorrect subnet associations.
- Best Practices:
- Use public subnets for internet-facing resources (e.g., load balancers).
- Use private subnets for internal resources (e.g., databases).
- Regularly review and update security group and network ACL rules.
Troubleshooting Connectivity Issues
- Steps to Debug:
- Verify the VPC has an internet gateway attached.
- Check route tables for correct routing rules.
- Ensure subnets are associated with the correct route tables.
- Review security groups for necessary inbound/outbound rules.
- Check network ACLs for traffic restrictions.
- Confirm instances are in the correct subnets and associated with the right security groups.
- Common Scenarios:
- Access Denied: Check IAM permissions and credentials.
- No Internet Access: Verify route tables and NAT gateway configurations.
- Connection Timeout: Check security groups and network ACLs.
Key Takeaways
- Data Ingestion: Critical for building robust data pipelines.
- Source Systems: Databases, files, and streaming systems are the primary sources.
- Data Types: Structured, semi-structured, and unstructured data require different handling.
- Tools: SQL, NoSQL, object storage (S3), streaming platforms (Kafka), and orchestration tools (Airflow) are essential for data engineers.
- ACID Compliance: Ensures data integrity in transactional systems.
- IAM: Central to managing access and permissions in cloud-based architectures.
- Networking: Understanding VPCs, subnets, route tables, and security groups is critical for building and troubleshooting data pipelines.
Source: DeepLearning.ai source systems course.