Delta Lake

One layer of the delta lakehouse.
It gives reliability and performance
Delta lake is protocol for reading and writing files to cloud storage.
Lakehouse has benefits of both data lake and data warehouse.
Open source

Delta lake features

ACID transactions on Spark
Scalable metadata handling
Unifies streaming and batch data processing
Structured steaming support
Schema enforcement and evolution
Time travel (data versioning)
Data skipping and Z-ordering
Insrets, Updates, Upserts, Merges and Deletes
Fully configurable/optimizable

Delta lake componenets

Delta lake storage layer
- It resides in cloud storage (AWS S3, Azure Blob, GCP)
- Low-cost, easily scalable object storage.
- Ensures consistency.
- It allow for flexibility.
Delta tables
- Data in parquet files that are kept in object storage.
- Keep transactions logs (_delta_log) in object storage (JSON files)
- Can be registered in the SQL metastore (Hive metastore, Unity catalog). It is optional.
Delta engine
- File management optimizations. Keep statistics about the data in the parquet files.
- Auto-optimized writes. (Smaller sized files automatically sized to min 128MB files)
- Performance optimization via Delta Disk caching (SSD)
- Data skipping, Z-ordering, Bin-packing, Bloom filters efficiently

What is the Delta transaction log? (_delta_log/)

Ordered record of the transactions performanded on a Delta table.
It is a directory of JSON files. Each file is a transaction.
It is a single source of truth for the table.
It is a versioned table. Each transaction is a version of the table.
It is a mechanism that the Delta engine uses to guarantee ACID transactions.
.crc files are used to verify the integrity of the JSON files. .crc file contains the metadata of the write.

How does the Transaction log work? Delta Lake breaks operations down into one or more of these steps (recorded in JSON files):

Add file - Add a new file to the table.
Remove file - Remove a file from the table.
Update Metadata - Add or update the metadata of the table.
Change Protocol - Add or update the protocol of the table.
Commit info - Commit the transaction.
Set transactions - Set the transaction ID of the table.

Delta transaction log at the file level

| my_table/
|-- delta_log/
|   |-- 00000.json (Transaction log file)
|   |-- 00001.json
|-- date=2025-03-01/ (Partitioned by date. It is a optional directory)
|   |-- file-1.parquet (Data file)
|   |-- file-2.parquet

Directory Partitioning

Not to be confused with Memory-Partition which applies to Partitions of data in Spark Memory
Disk-Partitioning applies to Partitions of data on disk.
Disk-Partitioning uses a directory structure to aggregate ‘like’ data together. df.write.format("delta").partitionBy("region").save("/mnt/tables/sales")
The above command will create a directory structure like this: dbfs:/mnt/tables/sales/region=US/ dbfs:/mnt/tables/sales/region=EU/ dbfs:/mnt/tables/sales/region=APAC/
Each region will have its own directory and the data files will be stored in that directory.
The data files will be stored in the region directory.

Adding commits (JSON files) to the transaction log

Each commit is a JSON file that contains the metadata of the transaction.
The JSON file contains the following information:
- The version of the table
- The timestamp of the transaction
- The operation performed (add, remove, update, etc.)
- The files added or removed from the table. The data file paths are in the commit files (JSON files).
- The metadata of the table (schema, partitioning, etc.)

e.g. 00000.json - Add 1.parquet, Add 2.parquet 00001.json - Remove 1.parquet, Remove 2.parquet, Add 3.parquet DESCRIBE commands (SQL)

DESCRIBE command returns the metadata of an existing table.
- Ex. Column names, data types, comments etc.
DESCRIBE EXTENDED command returns the metadata of an existing table.
- Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table.
DESCRIBE FORMATTED command returns the metadata of an existing table.
- Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table + detailed information about the table (schema, partitioning, location, properties, statistics, etc.).
DESCRIBE DETAIL command returns the metadata of an existing table.
- Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table + detailed information about the table (schema, partitioning, location, properties, statistics, etc.).
DESCRIBE HISTORY command returns the history of the table.
- Returns a more complete set of metadata for a delta table. (operation, user, operation metrics)
- Ex. The history of the table, the version of the table, the timestamp of the transaction, the operation performed (add, remove, update, etc.), the files added or removed from the table.

Delta Lake Time Travel via versioning

When you query a Delta table, output by default is the ‘Current’ Version (no Tombstones files).
However, time travel allows you to query an older snapshot of a Delta table. In effect, the JSON log files reads the ‘Tombstoned’ parquet files. Note these log files exists by default for 30 days. This means Time travel is limited to 30 days worth of logs.
Log retention duration is configurable. You can set the retention duration to a longer period of time if you want to keep the logs for a longer period of time.

SELECT * FROM sales TIMESTAMP AS OF '2025-03-01 00:00:00'
SELECT * FROM sales VERSION AS OF 1

How do you know which TIMESTAMP or VERSION # to use?
- Use the DESCRIBE HISTORY command to get the history of the table. The history of the table, the version of the table, the timestamp of the transaction, the operation performed (add, remove, update, etc.), the files added or removed from the table.

Delta Lake time travel Query an older snaphost fo a Delta table.

Re-creating analysis, reports or outputs
Reverting to a previous version of the table. Fixing mistakes (Oops! I deleted more rows than I should have!)
Auditing and compliance
Debugging and troubleshooting
Providing snapshot isolation

VACUUM on Delta table

VACUUM removes all files from the table directory that are no longer in the latest state of the transaction log for the table and are older than a retention threshold (168 hours = 7 days)
VACUUM my_table RETAIN 0 HOURS - This will delete all files from the table directory that are no longer in the latest state of the transaction log for the table and are older than a retention threshold (0 hours = 0 days).
Can add DRY RUN flag to end of of VACUUM command to see what files would be deleted without actually deleting them.
VACUUM my_table RETAIN 0 HOURS DRY RUN - This will show you what files would be deleted without actually deleting them.
If you run VACUUM on a Delta table, you lose ability to time travel back to a version older than the specified data retention period. Each time a checkpoint is written (by default, every 10 commits, a parquet file is saved in _delta_log), Databricks automatically cleans up log entries older than retention interval.

QnA

Q: Why was delta lake named as delta lake?

The name “Delta Lake” is a combination of two key aspects of its functionality:

Delta: Refers to the incremental changes recorded in the transaction log. Each update, insert, or delete is a small “delta” change to the overall dataset. The name highlights the fact that Delta Lake efficiently manages these incremental changes without rewriting the entire dataset.
Lake: Indicates that Delta Lake is designed for use in data lakes. Data lakes are repositories for large volumes of diverse data, often in various formats. Delta Lake enhances data lakes by providing ACID transactions, schema enforcement, and other features that improve data reliability and performance within the data lake environment.

Therefore, “Delta Lake” aptly describes the technology’s core function: managing incremental changes (delta) within a data lake environment. The name is concise, memorable, and accurately reflects its purpose and capabilities.

Spark Programming with Databricks

​QnA

QnA