Incremental Data Processing
1. Delta Lake ACID Transactions
Where ACID is Provided
- Delta Lake ensures ACID (Atomicity, Consistency, Isolation, Durability) for:
- Writes:
INSERT
,UPDATE
,MERGE
,DELETE
. - Reads: Concurrent readers see consistent snapshots.
- Metadata Operations: Schema changes, partition updates.
- Writes:
Benefits of ACID
- Atomicity: All operations in a transaction succeed or fail together.
- Consistency: Valid data state after each transaction.
- Isolation: Concurrent transactions donβt interfere.
- Durability: Committed data survives failures.
Checking ACID Compliance
- Delta Log: Transactions are logged in
_delta_log/
(JSON files). - Example:
2. Data vs. Metadata
Data | Metadata |
---|---|
Raw content (e.g., rows in a table). | Information about data (e.g., schema, partitions). |
Stored in Parquet files. | Stored in Delta Log (_delta_log/ ). |
Example: {"id": 1, "name": "Alice"} . | Example: {"schema": "id INT, name STRING"} . |
3. Managed vs. External Tables
Managed Table | External Table |
---|---|
Databricks controls lifecycle (DROP deletes data). | User manages data (DROP only removes metadata). |
Stored in DBFS root (e.g., dbfs:/user/hive/warehouse/ ). | Stored in user-defined cloud storage (e.g., S3, ADLS). |
Use Case: Ephemeral data (e.g., ETL intermediates). | Use Case: Shared datasets (e.g., raw zone in Lakehouse). |
Creating a Managed Table
Identifying Table Location
4. Delta Lake File Structure
- Directory Layout:
- Key Files:
_delta_log/000000N.json
: Transaction metadata..parquet
: Data files (optimized byOPTIMIZE
).
5. Time Travel & Versioning
Viewing History
Rolling Back
Querying a Version
6. Optimization Techniques
Z-Ordering
- Benefit: Co-locates related data (improves query speed for filters).
Vacuum
- Deletes Files: Removes old versions not in retention period.
Optimize (File Compaction)
- Merges small files into larger ones (improves read performance).
7. Table Management
CTAS (Create Table As Select)
Generated Columns
- Automatically computed from other columns.
Table Comments
8. CREATE OR REPLACE
vs. INSERT OVERWRITE
CREATE OR REPLACE TABLE | INSERT OVERWRITE TABLE |
---|---|
Drops and recreates table (resets metadata). | Overwrites data but preserves metadata (e.g., partitions). |
Use Case: Schema changes. | Use Case: Full refreshes without dropping. |
9. MERGE Command
When to Use
- Upserts: Insert new records + update existing ones.
- Deduplication: Merge only latest records.
Benefits
- Single atomic operation (avoids separate
INSERT
/UPDATE
). - Example:
10. COPY INTO
Why No Duplicates?
- Idempotent: Skips already loaded files (checks file names/sizes).
When to Use
- Incremental file ingestion (e.g., new CSV files in S3).
Example
11. Delta Live Tables (DLT) Pipelines
Components
- Notebook Libraries: Define transformations (Python/SQL).
- Target: Destination dataset (e.g., Delta table).
- Pipeline Settings: Trigger mode, clusters, storage.
Triggered vs. Continuous
Triggered | Continuous |
---|---|
Runs on-demand/schedule. | Real-time (low latency). |
Cost: Lower (no always-on cluster). | Cost: Higher (always-running). |
12. Auto Loader
Identifying Auto Loader Usage
- Source Location: Uses
cloudFiles
format.
When to Use
- Incremental file processing (e.g., streaming JSON logs).
Inferred Schema as STRING
- Default Behavior: Auto Loader infers all fields as
STRING
for flexibility. - Fix: Provide schema or use
schemaEvolutionMode
.
13. Constraint Violations
Default Behavior
ON VIOLATION FAIL UPDATE
: Aborts job on violation.ON VIOLATION DROP ROW
: Silently drops invalid rows.
Example
14. Change Data Capture (CDC)
APPLY CHANGES INTO
- Behavior: Merges CDC data (inserts/updates/deletes).
- Example:
15. DLT Troubleshooting
Common Issues
- Missing
LIVE
Keyword: - Missing
STREAM
for Streaming: - Error Identification: Check
Events
tab in DLT UI for failed notebooks.
Section 3: Incremental Data Processing - Detailed Expanded Notes
1. Delta Lake ACID Transactions
What are ACID Transactions?
ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee database transactions are processed reliably. In Delta Lake:
- Atomicity: Ensures all operations in a transaction complete successfully or none do.
- Example: If an
UPDATE
fails midway, no partial changes are committed.
- Example: If an
- Consistency: Ensures data meets all validation rules before/after transactions.
- Isolation: Concurrent transactions donβt interfere with each other.
- Durability: Once committed, changes persist even after system failures.
Where Delta Lake Provides ACID
- All writes (
INSERT
,UPDATE
,MERGE
,DELETE
) - Schema changes (e.g.,
ALTER TABLE
) - Partition updates
Benefits of ACID in Delta Lake
- Safe Concurrent Writes: Multiple users can modify data simultaneously without corruption.
- Rollback Capability: Failed transactions leave no partial data.
- Time Travel: View data at any point in time (leverages transaction log).
How to Verify ACID Compliance
Check the Delta transaction log:
Each JSON file represents an ACID-compliant transaction.
2. Data vs. Metadata
What is Data vs. Metadata?
- Data: The actual content stored in tables (e.g., rows like
{"id": 1, "name": "Alice"}
). - Metadata: Information about the data:
- Schema (column names, types)
- Partitioning scheme
- Table properties (owner, description)
Key Differences
Data | Metadata |
---|---|
Stored in Parquet files | Stored in Delta Log (_delta_log/ ) |
Large volume (TB/PB) | Small (KB/MB) |
Accessed via queries | Managed by Delta Lake |
Example
When you run DESCRIBE TABLE
, youβre viewing metadata. When you SELECT *
, youβre accessing data.
3. Managed vs. External Tables
What are Managed & External Tables?
-
Managed Tables:
- Databricks controls both metadata and data.
- Dropping the table (
DROP TABLE
) deletes the underlying files. - Stored in DBFS root by default (
dbfs:/user/hive/warehouse/
).
-
External Tables:
- Databricks manages only metadata.
- Dropping the table keeps the data files in your cloud storage.
- Path points to external locations (e.g.,
s3://bucket/path/
).
When to Use External Tables
- Shared Data: Multiple teams/processes access the same files.
- Avoid Accidental Deletion: Critical datasets shouldnβt vanish if a table is dropped.
- Cost Savings: Store cold data in cheaper object storage.
How to Create
4. Delta Lake File Structure
What is the Directory Structure?
Delta tables organize files for reliability and performance:
Key Components
- Delta Log (
_delta_log/
):- Tracks all changes (ACID compliance).
- Enables time travel.
- Data Files (
.parquet
):- Actual data in columnar format.
- Optimized via
ZORDER
andOPTIMIZE
.
5. Time Travel & Versioning
What is Time Travel?
Delta Lakeβs ability to query or restore data from a previous state using version history.
How It Works
- View History:
- Query Past Data:
- Restore a Version:
Use Cases
- Debugging (βWhat changed?β)
- Recovery from accidental deletes.
- Reproducing reports from historical data.
6. Optimization Techniques
Z-Ordering
What: Co-locates related data physically (like an index).
Why: Speeds up queries filtering on Z-Ordered columns.
Vacuum
What: Removes old files no longer referenced.
Why: Reduces storage costs.
OPTIMIZE
What: Compacts small files into larger ones.
Why: Improves read performance.
7. Table Management
CTAS (Create Table As Select)
What: Creates a new table from query results.
Why: Simplifies ETL pipelines.
Generated Columns
What: Columns computed from other columns.
Why: Avoids redundant storage.
8. MERGE Command
What is MERGE?
A SQL operation that combines INSERT
, UPDATE
, and DELETE
in a single atomic transaction.
When to Use
- Upserts: Insert new records or update existing ones.
- CDC (Change Data Capture): Apply incremental changes.
- Deduplication: Keep only the latest records.
Example
9. Delta Live Tables (DLT)
What is DLT?
A declarative framework for building reliable ETL pipelines with built-in quality controls.
Key Components
- Notebook Libraries: Define transformations (Python/SQL).
- Target: Output datasets (Delta tables).
- Pipeline Settings:
- Trigger mode (scheduled vs. continuous).
- Data quality rules (expectations).
Triggered vs. Continuous
Triggered | Continuous |
---|---|
Runs on schedule | Real-time processing |
Lower cost | Higher cost (always-on cluster) |
Higher latency | Low latency |
10. Auto Loader
What is Auto Loader?
A scalable file ingestion tool that incrementally processes new files in cloud storage.
When to Use
- Streaming file ingestion (e.g., log files, CDC data).
- Avoiding reprocessing of already loaded files.