Rajanand home page
Rajanand
💻 Tech
Home
Spark
SQL
Python
Notes
Contact
Newsletter
Newsletter
Search...
Navigation
Data Pipelines
Source Systems
Have a great day! 🤩
⌘K
Notes
Overview
Data Engineering
Intro to Data Engineering
Data Pipelines
Source Systems
Data Ingestion
On this page
Introduction to Source Systems
Key Concepts
Databases
Files and Object Storage
Streaming Systems
DataOps and Orchestration
Connecting to Source Systems
Identity and Access Management
Networking in the Cloud
Troubleshooting Connectivity Issues
Key Takeaways
Data Pipelines
Source Systems
​
Introduction to Source Systems
​
Key Concepts
Data Engineering Lifecycle
:
Stages
: Data generation, ingestion, transformation, storage, serving, and monitoring.
Importance of Ingestion
: 80% of AI/ML work involves data engineering, yet often overlooked.
Source Systems
:
Types
:
Databases
: Relational (SQL) and non-relational (NoSQL).
Files
: Structured (CSV), semi-structured (JSON), unstructured (text, images, videos).
Streaming Systems
: Continuous data flow (e.g., IoT devices, logs).
Data Types
:
Structured
: Tabular data (rows and columns).
Semi-structured
: JSON, XML (key-value pairs, nested structures).
Unstructured
: Text, images, audio, video.
​
Databases
Relational Databases
:
Structure
: Tables with rows and columns, linked by keys.
CRUD Operations
:
C
reate,
R
ead,
U
pdate,
D
elete.
Normalization
: Minimizes redundancy by splitting data into related tables.
Primary Key
: Uniquely identifies each row.
Foreign Key
: Links tables by referencing a primary key.
SQL
: Standard language for querying relational databases (e.g., SELECT, JOIN, WHERE).
NoSQL Databases
:
Types
: Key-value, document, wide-column, graph.
Flexibility
: No fixed schema, supports unstructured/semi-structured data.
Scalability
: Horizontal scaling across multiple servers.
Eventual Consistency
: Data may not be immediately consistent across nodes.
Examples
: MongoDB (document store), DynamoDB (key-value store).
ACID
Principles
:
Atomicity
: Transactions are all-or-nothing.
Consistency
: Data remains valid after transactions.
Isolation
: Concurrent transactions don’t interfere.
Durability
: Completed transactions are permanent.
​
Files and Object Storage
Files
:
Types
: CSV, JSON, text, images, videos.
Source Systems
: File systems (Google Drive), object storage (Amazon S3, ADLS, GCS).
Object Storage
:
Structure
: Flat, no hierarchy (despite folder-like UIs).
Key Features
:
UUID
: Unique identifier for each object.
Metadata
: Additional info (e.g., creation date, file type).
Immutable
: Objects can’t be updated; new versions are created.
Use Cases
:
Data lakes
, machine learning datasets.
​
Streaming Systems
Logs
:
Definition
: Append-only records of events (e.g., user activity, system errors).
Use Cases
: Monitoring, debugging, anomaly detection.
Log Levels
: Debug, Info, Warn, Error, Fatal.
Streaming Data
:
Components
:
Event Producer
: Generates messages (e.g., IoT devices, APIs).
Event Consumer
: Processes messages (e.g., payment systems, inventory updates).
Event Router
: Distributes messages (e.g., Apache Kafka, Amazon Kinesis).
Types
:
Message Queues
: Temporary storage (e.g., Amazon SQS).
Streaming Platforms
: Persistent logs (e.g., Apache Kafka).
​
DataOps and Orchestration
DataOps
:
Focus
: Automation, monitoring, and quality assurance in data pipelines.
Tools
: Infrastructure as code, monitoring tools.
Orchestration
:
DAGs
: Directed Acyclic Graphs for workflow management (e.g., Apache Airflow).
Tasks
: Automating pipeline tasks, ensuring data quality.
​
Connecting to Source Systems
Methods of Connection
:
AWS Management Console
: Manual, not repeatable or traceable.
Command Line Interface (CLI)
: More programmatic but still manual.
SDKs (e.g., Boto3)
: Programmatic and repeatable, ideal for automation.
API Connectors (e.g., JDBC, ODBC)
: Used to connect applications to databases.
Key Components for Connection
:
Endpoint
: The address of the resource (e.g., database endpoint).
Port
: The communication endpoint.
Credentials
: Username, password, or access keys for authentication.
Best Practices
:
Avoid manual processes for repeatability and traceability.
Use SDKs or APIs for automation and scalability.
​
Identity and Access Management
Identity and Access Management (IAM) Basics
:
Purpose
: Manage permissions for accessing cloud resources.
Principle of Least Privilege
: Grant only the necessary permissions for a limited time.
IAM Components
:
Root User
: The account creator with full access.
IAM Users
: Individuals with specific permissions.
IAM Groups
: Collections of users with shared permissions.
IAM Roles
: Temporary permissions for users, applications, or services.
Policies
: JSON documents defining permissions for resources.
Common IAM Issues
:
Misconfigured permissions.
Expired temporary credentials.
Storing credentials insecurely (e.g., public GitHub repositories).
Best Practices
:
Use roles for temporary access instead of long-term credentials.
Regularly review and update IAM policies.
Avoid granting excessive permissions.
​
Networking in the Cloud
Key Concepts
:
Regions and Availability Zones (AZs)
:
Regions
: Geographic areas with multiple AZs.
AZs
: Isolated data centers within a region.
Virtual Private Cloud (VPC)
: A private network within AWS.
Subnets
: Divisions within a VPC for organizing resources.
Internet Gateway
: Allows public subnets to connect to the internet.
NAT Gateway
: Allows private subnets to access the internet without exposing them to inbound traffic.
Route Tables
: Define how traffic is routed within a VPC.
Public Subnets
: Route traffic to the internet via an internet gateway.
Private Subnets
: Route traffic to the internet via a NAT gateway.
Security Groups and Network ACLs
:
Security Groups
:
Act as virtual firewalls at the instance level.
Stateful: Allow return traffic automatically.
Commonly used for EC2 instances, RDS databases, and load balancers.
Network ACLs
:
Provide subnet-level security.
Stateless: Require explicit inbound and outbound rules.
Useful for granular control over traffic.
Common Networking Issues
:
Misconfigured route tables.
Missing or incorrect security group rules.
Incorrect subnet associations.
Best Practices
:
Use public subnets for internet-facing resources (e.g., load balancers).
Use private subnets for internal resources (e.g., databases).
Regularly review and update security group and network ACL rules.
​
Troubleshooting Connectivity Issues
Steps to Debug
:
Verify the VPC has an internet gateway attached.
Check route tables for correct routing rules.
Ensure subnets are associated with the correct route tables.
Review security groups for necessary inbound/outbound rules.
Check network ACLs for traffic restrictions.
Confirm instances are in the correct subnets and associated with the right security groups.
Common Scenarios
:
Access Denied
: Check IAM permissions and credentials.
No Internet Access
: Verify route tables and NAT gateway configurations.
Connection Timeout
: Check security groups and network ACLs.
​
Key Takeaways
Data Ingestion
: Critical for building robust data pipelines.
Source Systems
: Databases, files, and streaming systems are the primary sources.
Data Types
: Structured, semi-structured, and unstructured data require different handling.
Tools
: SQL, NoSQL, object storage (S3), streaming platforms (Kafka), and orchestration tools (Airflow) are essential for data engineers.
ACID Compliance
: Ensures data integrity in transactional systems.
IAM
: Central to managing access and permissions in cloud-based architectures.
Networking
: Understanding VPCs, subnets, route tables, and security groups is critical for building and troubleshooting data pipelines.
Source
: DeepLearning.ai source systems course.
Assistant
Responses are generated using AI and may contain mistakes.
Requirements Gathering
Previous
Data Ingestion
Next