1. Delta Lake ACID Transactions

Where ACID is Provided

Delta Lake ensures ACID (Atomicity, Consistency, Isolation, Durability) for:
- Writes: INSERT, UPDATE, MERGE, DELETE.
- Reads: Concurrent readers see consistent snapshots.
- Metadata Operations: Schema changes, partition updates.

Benefits of ACID

Atomicity: All operations in a transaction succeed or fail together.
Consistency: Valid data state after each transaction.
Isolation: Concurrent transactions don’t interfere.
Durability: Committed data survives failures.

Checking ACID Compliance

Delta Log: Transactions are logged in _delta_log/ (JSON files).

Example:

DESCRIBE HISTORY my_table; -- Shows transaction log

2. Data vs. Metadata

Data	Metadata
Raw content (e.g., rows in a table).	Information about data (e.g., schema, partitions).
Stored in Parquet files.	Stored in Delta Log (`_delta_log/`).
Example: `{"id": 1, "name": "Alice"}`.	Example: `{"schema": "id INT, name STRING"}`.

3. Managed vs. External Tables

Managed Table	External Table
Databricks controls lifecycle (DROP deletes data).	User manages data (DROP only removes metadata).
Stored in DBFS root (e.g., `dbfs:/user/hive/warehouse/`).	Stored in user-defined cloud storage (e.g., S3, ADLS).
Use Case: Ephemeral data (e.g., ETL intermediates).	Use Case: Shared datasets (e.g., raw zone in Lakehouse).

Creating a Managed Table

CREATE TABLE managed_sales (id INT, amount DOUBLE);

Identifying Table Location

DESCRIBE EXTENDED my_table; -- Shows `Location`

4. Delta Lake File Structure

Directory Layout:

/delta/table_name/
  ├── _delta_log/       # Transaction logs (JSON)
  ├── part-00000.parquet # Data files
  └── part-00001.parquet

Key Files:
- _delta_log/000000N.json: Transaction metadata.
- .parquet: Data files (optimized by OPTIMIZE).

5. Time Travel & Versioning

Viewing History

DESCRIBE HISTORY my_table; -- Shows versions, timestamps, users

Rolling Back

RESTORE TABLE my_table TO VERSION AS OF 3; -- Reverts to version 3

Querying a Version

SELECT * FROM my_table VERSION AS OF 2; -- Time travel query

6. Optimization Techniques

Z-Ordering

Benefit: Co-locates related data (improves query speed for filters).

OPTIMIZE my_table ZORDER BY (date); -- Speeds up `WHERE date = '2023-01-01'`

Vacuum

Deletes Files: Removes old versions not in retention period.

VACUUM my_table RETAIN 168 HOURS; -- Keeps 7 days of history

Optimize (File Compaction)

Merges small files into larger ones (improves read performance).
```
OPTIMIZE my_table;
```

7. Table Management

CTAS (Create Table As Select)

CREATE TABLE new_table AS SELECT * FROM old_table WHERE amount > 0;

Generated Columns

Automatically computed from other columns.

CREATE TABLE sales (
  id INT,
  amount DOUBLE,
  tax DOUBLE GENERATED ALWAYS AS (amount * 0.1)
);

Table Comments

COMMENT ON TABLE my_table IS 'Stores cleaned sales data';

8. `CREATE OR REPLACE` vs. `INSERT OVERWRITE`

`CREATE OR REPLACE TABLE`	`INSERT OVERWRITE TABLE`
Drops and recreates table (resets metadata).	Overwrites data but preserves metadata (e.g., partitions).
Use Case: Schema changes.	Use Case: Full refreshes without dropping.

9. MERGE Command

When to Use

Upserts: Insert new records + update existing ones.
Deduplication: Merge only latest records.

Benefits

Single atomic operation (avoids separate INSERT/UPDATE).

Example:

MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

10. COPY INTO

Why No Duplicates?

Idempotent: Skips already loaded files (checks file names/sizes).

When to Use

Incremental file ingestion (e.g., new CSV files in S3).

Example

COPY INTO sales FROM '/path/to/files/' FILEFORMAT = CSV;

11. Delta Live Tables (DLT) Pipelines

Components

Notebook Libraries: Define transformations (Python/SQL).
Target: Destination dataset (e.g., Delta table).
Pipeline Settings: Trigger mode, clusters, storage.

Triggered vs. Continuous

Triggered	Continuous
Runs on-demand/schedule.	Real-time (low latency).
Cost: Lower (no always-on cluster).	Cost: Higher (always-running).

12. Auto Loader

Identifying Auto Loader Usage

Source Location: Uses cloudFiles format.

df = spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").load("/path/")

When to Use

Incremental file processing (e.g., streaming JSON logs).

Inferred Schema as STRING

Default Behavior: Auto Loader infers all fields as STRING for flexibility.
Fix: Provide schema or use schemaEvolutionMode.

13. Constraint Violations

Default Behavior

ON VIOLATION FAIL UPDATE: Aborts job on violation.
ON VIOLATION DROP ROW: Silently drops invalid rows.

Example

CREATE TABLE sales (
  id INT CONSTRAINT valid_id CHECK (id > 0)
) WITH (ON VIOLATION FAIL UPDATE);

14. Change Data Capture (CDC)

APPLY CHANGES INTO

Behavior: Merges CDC data (inserts/updates/deletes).

Example:

APPLY CHANGES INTO target_table
FROM source_stream
KEYS (id)
APPLY AS DELETE WHEN operation = 'DELETE';

15. DLT Troubleshooting

Common Issues

Missing LIVE Keyword:

CREATE LIVE TABLE my_table AS SELECT * FROM source; -- Required in DLT

Missing STREAM for Streaming:

FROM STREAM(LIVE.source_table) -- Required for streaming sources

Error Identification: Check Events tab in DLT UI for failed notebooks.

Section 3: Incremental Data Processing - Detailed Expanded Notes

1. Delta Lake ACID Transactions

What are ACID Transactions?

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee database transactions are processed reliably. In Delta Lake:

Atomicity: Ensures all operations in a transaction complete successfully or none do.
- Example: If an UPDATE fails midway, no partial changes are committed.
Consistency: Ensures data meets all validation rules before/after transactions.
Isolation: Concurrent transactions don’t interfere with each other.
Durability: Once committed, changes persist even after system failures.

Where Delta Lake Provides ACID

All writes (INSERT, UPDATE, MERGE, DELETE)
Schema changes (e.g., ALTER TABLE)
Partition updates

Benefits of ACID in Delta Lake

Safe Concurrent Writes: Multiple users can modify data simultaneously without corruption.
Rollback Capability: Failed transactions leave no partial data.
Time Travel: View data at any point in time (leverages transaction log).

How to Verify ACID Compliance

Check the Delta transaction log:

dbfs:/delta/table_name/_delta_log/

Each JSON file represents an ACID-compliant transaction.

2. Data vs. Metadata

What is Data vs. Metadata?

Data: The actual content stored in tables (e.g., rows like {"id": 1, "name": "Alice"}).
Metadata: Information about the data:
- Schema (column names, types)
- Partitioning scheme
- Table properties (owner, description)

Key Differences

Data	Metadata
Stored in Parquet files	Stored in Delta Log (`_delta_log/`)
Large volume (TB/PB)	Small (KB/MB)
Accessed via queries	Managed by Delta Lake

Example

When you run DESCRIBE TABLE, you’re viewing metadata. When you SELECT *, you’re accessing data.

3. Managed vs. External Tables

What are Managed & External Tables?

Managed Tables:
- Databricks controls both metadata and data.
- Dropping the table (DROP TABLE) deletes the underlying files.
- Stored in DBFS root by default (dbfs:/user/hive/warehouse/).
External Tables:
- Databricks manages only metadata.
- Dropping the table keeps the data files in your cloud storage.
- Path points to external locations (e.g., s3://bucket/path/).

When to Use External Tables

Shared Data: Multiple teams/processes access the same files.
Avoid Accidental Deletion: Critical datasets shouldn’t vanish if a table is dropped.
Cost Savings: Store cold data in cheaper object storage.

How to Create

-- Managed
CREATE TABLE managed_sales (id INT, amount DOUBLE);

-- External
CREATE TABLE external_sales (id INT, amount DOUBLE)
LOCATION 's3://bucket/sales/';

4. Delta Lake File Structure

What is the Directory Structure?

Delta tables organize files for reliability and performance:

/delta/table_name/
├── _delta_log/       # Transaction logs (JSON)
│   ├── 000000.json   # Version 0
│   └── 000001.json   # Version 1
├── part-00000.parquet # Data file 
└── part-00001.parquet

Key Components

Delta Log (_delta_log/):
- Tracks all changes (ACID compliance).
- Enables time travel.
Data Files (.parquet):
- Actual data in columnar format.
- Optimized via ZORDER and OPTIMIZE.

5. Time Travel & Versioning

What is Time Travel?

Delta Lake’s ability to query or restore data from a previous state using version history.

How It Works

View History:

DESCRIBE HISTORY sales;  -- Shows versions, timestamps, users

Query Past Data:

-- Query version 2
SELECT * FROM sales VERSION AS OF 2;

Restore a Version:

RESTORE TABLE sales TO VERSION AS OF 2;  -- Rollback

Use Cases

Debugging (“What changed?”)
Recovery from accidental deletes.
Reproducing reports from historical data.

6. Optimization Techniques

Z-Ordering

What: Co-locates related data physically (like an index).
Why: Speeds up queries filtering on Z-Ordered columns.

OPTIMIZE sales ZORDER BY (date, customer_id);

Vacuum

What: Removes old files no longer referenced.
Why: Reduces storage costs.

VACUUM sales RETAIN 168 HOURS;  -- Keeps 7 days of history

OPTIMIZE

What: Compacts small files into larger ones.
Why: Improves read performance.

OPTIMIZE sales;

7. Table Management

CTAS (Create Table As Select)

What: Creates a new table from query results.
Why: Simplifies ETL pipelines.

CREATE TABLE high_value_sales AS 
SELECT * FROM sales WHERE amount > 1000;

Generated Columns

What: Columns computed from other columns.
Why: Avoids redundant storage.

CREATE TABLE sales (
  id INT,
  amount DOUBLE,
  tax DOUBLE GENERATED ALWAYS AS (amount * 0.1)  -- Auto-calculated
);

8. MERGE Command

What is MERGE?

A SQL operation that combines INSERT, UPDATE, and DELETE in a single atomic transaction.

When to Use

Upserts: Insert new records or update existing ones.
CDC (Change Data Capture): Apply incremental changes.
Deduplication: Keep only the latest records.

Example

MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

9. Delta Live Tables (DLT)

What is DLT?

A declarative framework for building reliable ETL pipelines with built-in quality controls.

Key Components

Notebook Libraries: Define transformations (Python/SQL).
Target: Output datasets (Delta tables).
Pipeline Settings:
- Trigger mode (scheduled vs. continuous).
- Data quality rules (expectations).

Triggered vs. Continuous

Triggered	Continuous
Runs on schedule	Real-time processing
Lower cost	Higher cost (always-on cluster)
Higher latency	Low latency

10. Auto Loader

What is Auto Loader?

A scalable file ingestion tool that incrementally processes new files in cloud storage.

When to Use

Streaming file ingestion (e.g., log files, CDC data).
Avoiding reprocessing of already loaded files.

Example

df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .load("s3://logs/")
)

Data Engineer Associate

​1. Delta Lake ACID Transactions

​Where ACID is Provided

​Benefits of ACID

​Checking ACID Compliance

​2. Data vs. Metadata

​3. Managed vs. External Tables

​Creating a Managed Table

​Identifying Table Location

​4. Delta Lake File Structure

​5. Time Travel & Versioning

​Viewing History

​Rolling Back

​Querying a Version

​6. Optimization Techniques

​Z-Ordering

​Vacuum

​Optimize (File Compaction)

​7. Table Management

​CTAS (Create Table As Select)

​Generated Columns

​Table Comments

​8. CREATE OR REPLACE vs. INSERT OVERWRITE

​9. MERGE Command

​When to Use

​Benefits

​10. COPY INTO

​Why No Duplicates?

​When to Use

​Example

​11. Delta Live Tables (DLT) Pipelines

​Components

​Triggered vs. Continuous

​12. Auto Loader

​Identifying Auto Loader Usage

​When to Use

​Inferred Schema as STRING

​13. Constraint Violations

​Default Behavior

​Example

​14. Change Data Capture (CDC)

​APPLY CHANGES INTO

​15. DLT Troubleshooting

​Common Issues

​Section 3: Incremental Data Processing - Detailed Expanded Notes

​1. Delta Lake ACID Transactions

​What are ACID Transactions?

​Where Delta Lake Provides ACID

​Benefits of ACID in Delta Lake

​How to Verify ACID Compliance

​2. Data vs. Metadata

​What is Data vs. Metadata?

​Key Differences

​Example

​3. Managed vs. External Tables

​What are Managed & External Tables?

​When to Use External Tables

​How to Create

​4. Delta Lake File Structure

​What is the Directory Structure?

​Key Components

​5. Time Travel & Versioning

​What is Time Travel?

​How It Works

​Use Cases

​6. Optimization Techniques

​Z-Ordering

​Vacuum

​OPTIMIZE

​7. Table Management

​CTAS (Create Table As Select)

​Generated Columns

​8. MERGE Command

​What is MERGE?

​When to Use

​Example

​9. Delta Live Tables (DLT)

​What is DLT?

​Key Components

​Triggered vs. Continuous

1. Delta Lake ACID Transactions

Where ACID is Provided

Benefits of ACID

Checking ACID Compliance

2. Data vs. Metadata

3. Managed vs. External Tables

Creating a Managed Table

Identifying Table Location

4. Delta Lake File Structure

5. Time Travel & Versioning

Viewing History

Rolling Back

Querying a Version

6. Optimization Techniques

Z-Ordering

Vacuum

Optimize (File Compaction)

7. Table Management

CTAS (Create Table As Select)

Generated Columns

Table Comments

8. `CREATE OR REPLACE` vs. `INSERT OVERWRITE`

9. MERGE Command

When to Use

Benefits

10. COPY INTO

Why No Duplicates?

When to Use

Example

11. Delta Live Tables (DLT) Pipelines

Components

Triggered vs. Continuous

12. Auto Loader

Identifying Auto Loader Usage

When to Use

Inferred Schema as STRING

13. Constraint Violations

Default Behavior

Example

14. Change Data Capture (CDC)

APPLY CHANGES INTO

15. DLT Troubleshooting

Common Issues

Section 3: Incremental Data Processing - Detailed Expanded Notes

1. Delta Lake ACID Transactions

What are ACID Transactions?

Where Delta Lake Provides ACID

Benefits of ACID in Delta Lake

How to Verify ACID Compliance

2. Data vs. Metadata

What is Data vs. Metadata?

Key Differences

Example

3. Managed vs. External Tables

What are Managed & External Tables?

When to Use External Tables

How to Create

4. Delta Lake File Structure

What is the Directory Structure?

Key Components

5. Time Travel & Versioning

What is Time Travel?

How It Works

Use Cases

6. Optimization Techniques

Z-Ordering

Vacuum

OPTIMIZE

7. Table Management

CTAS (Create Table As Select)

Generated Columns

8. MERGE Command

What is MERGE?

When to Use

Example

9. Delta Live Tables (DLT)

What is DLT?

Key Components

Triggered vs. Continuous

10. Auto Loader