Delta Lake
  • One layer of the delta lakehouse.
  • It gives reliability and performance
  • Delta lake is protocol for reading and writing files to cloud storage.
  • Lakehouse has benefits of both data lake and data warehouse.
  • Open source
Delta lake features
  • ACID transactions on Spark
  • Scalable metadata handling
  • Unifies streaming and batch data processing
  • Structured steaming support
  • Schema enforcement and evolution
  • Time travel (data versioning)
  • Data skipping and Z-ordering
  • Insrets, Updates, Upserts, Merges and Deletes
  • Fully configurable/optimizable
Delta lake componenets
  • Delta lake storage layer
    • It resides in cloud storage (AWS S3, Azure Blob, GCP)
    • Low-cost, easily scalable object storage.
    • Ensures consistency.
    • It allow for flexibility.
  • Delta tables
    • Data in parquet files that are kept in object storage.
    • Keep transactions logs (_delta_log) in object storage (JSON files)
    • Can be registered in the SQL metastore (Hive metastore, Unity catalog). It is optional.
  • Delta engine
    • File management optimizations. Keep statistics about the data in the parquet files.
    • Auto-optimized writes. (Smaller sized files automatically sized to min 128MB files)
    • Performance optimization via Delta Disk caching (SSD)
    • Data skipping, Z-ordering, Bin-packing, Bloom filters efficiently
What is the Delta transaction log? (_delta_log/)
  • Ordered record of the transactions performanded on a Delta table.
  • It is a directory of JSON files. Each file is a transaction.
  • It is a single source of truth for the table.
  • It is a versioned table. Each transaction is a version of the table.
  • It is a mechanism that the Delta engine uses to guarantee ACID transactions.
  • .crc files are used to verify the integrity of the JSON files. .crc file contains the metadata of the write.
How does the Transaction log work? Delta Lake breaks operations down into one or more of these steps (recorded in JSON files):
  • Add file - Add a new file to the table.
  • Remove file - Remove a file from the table.
  • Update Metadata - Add or update the metadata of the table.
  • Change Protocol - Add or update the protocol of the table.
  • Commit info - Commit the transaction.
  • Set transactions - Set the transaction ID of the table.
Delta transaction log at the file level
| my_table/
|-- delta_log/
|   |-- 00000.json (Transaction log file)
|   |-- 00001.json
|-- date=2025-03-01/ (Partitioned by date. It is a optional directory)
|   |-- file-1.parquet (Data file)
|   |-- file-2.parquet
Directory Partitioning
  • Not to be confused with Memory-Partition which applies to Partitions of data in Spark Memory
  • Disk-Partitioning applies to Partitions of data on disk.
  • Disk-Partitioning uses a directory structure to aggregate ‘like’ data together. df.write.format("delta").partitionBy("region").save("/mnt/tables/sales")
  • The above command will create a directory structure like this: dbfs:/mnt/tables/sales/region=US/ dbfs:/mnt/tables/sales/region=EU/ dbfs:/mnt/tables/sales/region=APAC/
  • Each region will have its own directory and the data files will be stored in that directory.
  • The data files will be stored in the region directory.
Adding commits (JSON files) to the transaction log
  • Each commit is a JSON file that contains the metadata of the transaction.
  • The JSON file contains the following information:
    • The version of the table
    • The timestamp of the transaction
    • The operation performed (add, remove, update, etc.)
    • The files added or removed from the table. The data file paths are in the commit files (JSON files).
    • The metadata of the table (schema, partitioning, etc.)
e.g. 00000.json - Add 1.parquet, Add 2.parquet 00001.json - Remove 1.parquet, Remove 2.parquet, Add 3.parquet DESCRIBE commands (SQL)
  • DESCRIBE command returns the metadata of an existing table.
    • Ex. Column names, data types, comments etc.
  • DESCRIBE EXTENDED command returns the metadata of an existing table.
    • Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table.
  • DESCRIBE FORMATTED command returns the metadata of an existing table.
    • Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table + detailed information about the table (schema, partitioning, location, properties, statistics, etc.).
  • DESCRIBE DETAIL command returns the metadata of an existing table.
    • Ex. Column names, data types, comments etc. + partitioning information + location of the table + table properties + statistics about the table + detailed information about the table (schema, partitioning, location, properties, statistics, etc.).
  • DESCRIBE HISTORY command returns the history of the table.
    • Returns a more complete set of metadata for a delta table. (operation, user, operation metrics)
    • Ex. The history of the table, the version of the table, the timestamp of the transaction, the operation performed (add, remove, update, etc.), the files added or removed from the table.
Delta Lake Time Travel via versioning
  • When you query a Delta table, output by default is the ‘Current’ Version (no Tombstones files).
  • However, time travel allows you to query an older snapshot of a Delta table. In effect, the JSON log files reads the ‘Tombstoned’ parquet files. Note these log files exists by default for 30 days. This means Time travel is limited to 30 days worth of logs.
  • Log retention duration is configurable. You can set the retention duration to a longer period of time if you want to keep the logs for a longer period of time.
SELECT * FROM sales TIMESTAMP AS OF '2025-03-01 00:00:00'
SELECT * FROM sales VERSION AS OF 1
  • How do you know which TIMESTAMP or VERSION # to use?
    • Use the DESCRIBE HISTORY command to get the history of the table. The history of the table, the version of the table, the timestamp of the transaction, the operation performed (add, remove, update, etc.), the files added or removed from the table.
Delta Lake time travel Query an older snaphost fo a Delta table.
  • Re-creating analysis, reports or outputs
  • Reverting to a previous version of the table. Fixing mistakes (Oops! I deleted more rows than I should have!)
  • Auditing and compliance
  • Debugging and troubleshooting
  • Providing snapshot isolation
VACUUM on Delta table
  • VACUUM removes all files from the table directory that are no longer in the latest state of the transaction log for the table and are older than a retention threshold (168 hours = 7 days)
  • VACUUM my_table RETAIN 0 HOURS - This will delete all files from the table directory that are no longer in the latest state of the transaction log for the table and are older than a retention threshold (0 hours = 0 days).
  • Can add DRY RUN flag to end of of VACUUM command to see what files would be deleted without actually deleting them.
  • VACUUM my_table RETAIN 0 HOURS DRY RUN - This will show you what files would be deleted without actually deleting them.
  • If you run VACUUM on a Delta table, you lose ability to time travel back to a version older than the specified data retention period. Each time a checkpoint is written (by default, every 10 commits, a parquet file is saved in _delta_log), Databricks automatically cleans up log entries older than retention interval.

QnA

Q: Why was delta lake named as delta lake?