1. Delta Lake ACID Transactions

Where ACID is Provided

  • Delta Lake ensures ACID (Atomicity, Consistency, Isolation, Durability) for:
    • Writes: INSERT, UPDATE, MERGE, DELETE.
    • Reads: Concurrent readers see consistent snapshots.
    • Metadata Operations: Schema changes, partition updates.

Benefits of ACID

  • Atomicity: All operations in a transaction succeed or fail together.
  • Consistency: Valid data state after each transaction.
  • Isolation: Concurrent transactions don’t interfere.
  • Durability: Committed data survives failures.

Checking ACID Compliance

  • Delta Log: Transactions are logged in _delta_log/ (JSON files).
  • Example:
    DESCRIBE HISTORY my_table; -- Shows transaction log
    

2. Data vs. Metadata

DataMetadata
Raw content (e.g., rows in a table).Information about data (e.g., schema, partitions).
Stored in Parquet files.Stored in Delta Log (_delta_log/).
Example: {"id": 1, "name": "Alice"}.Example: {"schema": "id INT, name STRING"}.

3. Managed vs. External Tables

Managed TableExternal Table
Databricks controls lifecycle (DROP deletes data).User manages data (DROP only removes metadata).
Stored in DBFS root (e.g., dbfs:/user/hive/warehouse/).Stored in user-defined cloud storage (e.g., S3, ADLS).
Use Case: Ephemeral data (e.g., ETL intermediates).Use Case: Shared datasets (e.g., raw zone in Lakehouse).

Creating a Managed Table

CREATE TABLE managed_sales (id INT, amount DOUBLE);

Identifying Table Location

DESCRIBE EXTENDED my_table; -- Shows `Location`

4. Delta Lake File Structure

  • Directory Layout:
    /delta/table_name/
      β”œβ”€β”€ _delta_log/       # Transaction logs (JSON)
      β”œβ”€β”€ part-00000.parquet # Data files
      └── part-00001.parquet
    
  • Key Files:
    • _delta_log/000000N.json: Transaction metadata.
    • .parquet: Data files (optimized by OPTIMIZE).

5. Time Travel & Versioning

Viewing History

DESCRIBE HISTORY my_table; -- Shows versions, timestamps, users

Rolling Back

RESTORE TABLE my_table TO VERSION AS OF 3; -- Reverts to version 3

Querying a Version

SELECT * FROM my_table VERSION AS OF 2; -- Time travel query

6. Optimization Techniques

Z-Ordering

  • Benefit: Co-locates related data (improves query speed for filters).
    OPTIMIZE my_table ZORDER BY (date); -- Speeds up `WHERE date = '2023-01-01'`
    

Vacuum

  • Deletes Files: Removes old versions not in retention period.
    VACUUM my_table RETAIN 168 HOURS; -- Keeps 7 days of history
    

Optimize (File Compaction)

  • Merges small files into larger ones (improves read performance).
    OPTIMIZE my_table;
    

7. Table Management

CTAS (Create Table As Select)

CREATE TABLE new_table AS SELECT * FROM old_table WHERE amount > 0;

Generated Columns

  • Automatically computed from other columns.
    CREATE TABLE sales (
      id INT,
      amount DOUBLE,
      tax DOUBLE GENERATED ALWAYS AS (amount * 0.1)
    );
    

Table Comments

COMMENT ON TABLE my_table IS 'Stores cleaned sales data';

8. CREATE OR REPLACE vs. INSERT OVERWRITE

CREATE OR REPLACE TABLEINSERT OVERWRITE TABLE
Drops and recreates table (resets metadata).Overwrites data but preserves metadata (e.g., partitions).
Use Case: Schema changes.Use Case: Full refreshes without dropping.

9. MERGE Command

When to Use

  • Upserts: Insert new records + update existing ones.
  • Deduplication: Merge only latest records.

Benefits

  • Single atomic operation (avoids separate INSERT/UPDATE).
  • Example:
    MERGE INTO target t
    USING source s ON t.id = s.id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *;
    

10. COPY INTO

Why No Duplicates?

  • Idempotent: Skips already loaded files (checks file names/sizes).

When to Use

  • Incremental file ingestion (e.g., new CSV files in S3).

Example

COPY INTO sales FROM '/path/to/files/' FILEFORMAT = CSV;

11. Delta Live Tables (DLT) Pipelines

Components

  1. Notebook Libraries: Define transformations (Python/SQL).
  2. Target: Destination dataset (e.g., Delta table).
  3. Pipeline Settings: Trigger mode, clusters, storage.

Triggered vs. Continuous

TriggeredContinuous
Runs on-demand/schedule.Real-time (low latency).
Cost: Lower (no always-on cluster).Cost: Higher (always-running).

12. Auto Loader

Identifying Auto Loader Usage

  • Source Location: Uses cloudFiles format.
    df = spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").load("/path/")
    

When to Use

  • Incremental file processing (e.g., streaming JSON logs).

Inferred Schema as STRING

  • Default Behavior: Auto Loader infers all fields as STRING for flexibility.
  • Fix: Provide schema or use schemaEvolutionMode.

13. Constraint Violations

Default Behavior

  • ON VIOLATION FAIL UPDATE: Aborts job on violation.
  • ON VIOLATION DROP ROW: Silently drops invalid rows.

Example

CREATE TABLE sales (
  id INT CONSTRAINT valid_id CHECK (id > 0)
) WITH (ON VIOLATION FAIL UPDATE);

14. Change Data Capture (CDC)

APPLY CHANGES INTO

  • Behavior: Merges CDC data (inserts/updates/deletes).
  • Example:
    APPLY CHANGES INTO target_table
    FROM source_stream
    KEYS (id)
    APPLY AS DELETE WHEN operation = 'DELETE';
    

15. DLT Troubleshooting

Common Issues

  • Missing LIVE Keyword:
    CREATE LIVE TABLE my_table AS SELECT * FROM source; -- Required in DLT
    
  • Missing STREAM for Streaming:
    FROM STREAM(LIVE.source_table) -- Required for streaming sources
    
  • Error Identification: Check Events tab in DLT UI for failed notebooks.

Section 3: Incremental Data Processing - Detailed Expanded Notes

1. Delta Lake ACID Transactions

What are ACID Transactions?

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee database transactions are processed reliably. In Delta Lake:

  • Atomicity: Ensures all operations in a transaction complete successfully or none do.
    • Example: If an UPDATE fails midway, no partial changes are committed.
  • Consistency: Ensures data meets all validation rules before/after transactions.
  • Isolation: Concurrent transactions don’t interfere with each other.
  • Durability: Once committed, changes persist even after system failures.

Where Delta Lake Provides ACID

  • All writes (INSERT, UPDATE, MERGE, DELETE)
  • Schema changes (e.g., ALTER TABLE)
  • Partition updates

Benefits of ACID in Delta Lake

  1. Safe Concurrent Writes: Multiple users can modify data simultaneously without corruption.
  2. Rollback Capability: Failed transactions leave no partial data.
  3. Time Travel: View data at any point in time (leverages transaction log).

How to Verify ACID Compliance

Check the Delta transaction log:

dbfs:/delta/table_name/_delta_log/

Each JSON file represents an ACID-compliant transaction.


2. Data vs. Metadata

What is Data vs. Metadata?

  • Data: The actual content stored in tables (e.g., rows like {"id": 1, "name": "Alice"}).
  • Metadata: Information about the data:
    • Schema (column names, types)
    • Partitioning scheme
    • Table properties (owner, description)

Key Differences

DataMetadata
Stored in Parquet filesStored in Delta Log (_delta_log/)
Large volume (TB/PB)Small (KB/MB)
Accessed via queriesManaged by Delta Lake

Example

When you run DESCRIBE TABLE, you’re viewing metadata. When you SELECT *, you’re accessing data.


3. Managed vs. External Tables

What are Managed & External Tables?

  • Managed Tables:

    • Databricks controls both metadata and data.
    • Dropping the table (DROP TABLE) deletes the underlying files.
    • Stored in DBFS root by default (dbfs:/user/hive/warehouse/).
  • External Tables:

    • Databricks manages only metadata.
    • Dropping the table keeps the data files in your cloud storage.
    • Path points to external locations (e.g., s3://bucket/path/).

When to Use External Tables

  1. Shared Data: Multiple teams/processes access the same files.
  2. Avoid Accidental Deletion: Critical datasets shouldn’t vanish if a table is dropped.
  3. Cost Savings: Store cold data in cheaper object storage.

How to Create

-- Managed
CREATE TABLE managed_sales (id INT, amount DOUBLE);

-- External
CREATE TABLE external_sales (id INT, amount DOUBLE)
LOCATION 's3://bucket/sales/';

4. Delta Lake File Structure

What is the Directory Structure?

Delta tables organize files for reliability and performance:

/delta/table_name/
β”œβ”€β”€ _delta_log/       # Transaction logs (JSON)
β”‚   β”œβ”€β”€ 000000.json   # Version 0
β”‚   └── 000001.json   # Version 1
β”œβ”€β”€ part-00000.parquet # Data file 
└── part-00001.parquet

Key Components

  1. Delta Log (_delta_log/):
    • Tracks all changes (ACID compliance).
    • Enables time travel.
  2. Data Files (.parquet):
    • Actual data in columnar format.
    • Optimized via ZORDER and OPTIMIZE.

5. Time Travel & Versioning

What is Time Travel?

Delta Lake’s ability to query or restore data from a previous state using version history.

How It Works

  1. View History:
    DESCRIBE HISTORY sales;  -- Shows versions, timestamps, users
    
  2. Query Past Data:
    -- Query version 2
    SELECT * FROM sales VERSION AS OF 2;  
    
  3. Restore a Version:
    RESTORE TABLE sales TO VERSION AS OF 2;  -- Rollback
    

Use Cases

  • Debugging (β€œWhat changed?”)
  • Recovery from accidental deletes.
  • Reproducing reports from historical data.

6. Optimization Techniques

Z-Ordering

What: Co-locates related data physically (like an index).
Why: Speeds up queries filtering on Z-Ordered columns.

OPTIMIZE sales ZORDER BY (date, customer_id);  

Vacuum

What: Removes old files no longer referenced.
Why: Reduces storage costs.

VACUUM sales RETAIN 168 HOURS;  -- Keeps 7 days of history

OPTIMIZE

What: Compacts small files into larger ones.
Why: Improves read performance.

OPTIMIZE sales;  

7. Table Management

CTAS (Create Table As Select)

What: Creates a new table from query results.
Why: Simplifies ETL pipelines.

CREATE TABLE high_value_sales AS 
SELECT * FROM sales WHERE amount > 1000;  

Generated Columns

What: Columns computed from other columns.
Why: Avoids redundant storage.

CREATE TABLE sales (
  id INT,
  amount DOUBLE,
  tax DOUBLE GENERATED ALWAYS AS (amount * 0.1)  -- Auto-calculated
);  

8. MERGE Command

What is MERGE?

A SQL operation that combines INSERT, UPDATE, and DELETE in a single atomic transaction.

When to Use

  1. Upserts: Insert new records or update existing ones.
  2. CDC (Change Data Capture): Apply incremental changes.
  3. Deduplication: Keep only the latest records.

Example

MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;  

9. Delta Live Tables (DLT)

What is DLT?

A declarative framework for building reliable ETL pipelines with built-in quality controls.

Key Components

  1. Notebook Libraries: Define transformations (Python/SQL).
  2. Target: Output datasets (Delta tables).
  3. Pipeline Settings:
    • Trigger mode (scheduled vs. continuous).
    • Data quality rules (expectations).

Triggered vs. Continuous

TriggeredContinuous
Runs on scheduleReal-time processing
Lower costHigher cost (always-on cluster)
Higher latencyLow latency

10. Auto Loader

What is Auto Loader?

A scalable file ingestion tool that incrementally processes new files in cloud storage.

When to Use

  • Streaming file ingestion (e.g., log files, CDC data).
  • Avoiding reprocessing of already loaded files.

Example

df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .load("s3://logs/")
)