vacuum
command).
Lazy deletion offers several advantages:
vacuum
command removes the physical files, reclaiming storage space.part1=value1/part2=value2/...
). The partitioning is for organizational purposes and isn’t strictly required by the protocol itself; the actual partition values are tracked in the transaction log._change_data
directory, these files record changes made to the table in a specific version. This is useful for change data capture (CDC) applications, allowing efficient tracking of data modifications over time._delta_log
directory, these JSON files record every action performed on the table. This log is crucial for maintaining the table’s history and enabling consistent reads and writes._delta_log
, checkpoints are snapshots of the table’s state at a specific version. They provide a more efficient way to read the table’s data than processing the entire transaction log, significantly improving read performance. Different checkpoint types exist (UUID-named, classic, and deprecated multi-part)._delta_log
, aggregate actions from a range of commits. They help reduce the size and improve the efficiency of the transaction log over time._delta_log/_last_checkpoint
, this file points to the most recent checkpoint, providing quick access to the latest table state._delta_log
, contains checksums to verify the integrity of the table’s data and metadata.metaData
: Modifies the table’s metadata (schema, properties, etc.).add
and remove
: Add or remove data files or other files from the table.addCDCFile
: Adds a change data file.txn
: Records application-specific transaction identifiers, linking actions to specific transactions.protocol
: Updates the protocol version used by the table.commitInfo
: Stores information about the commit operation, such as timestamps and committer information.domainMetadata
: Allows for storing custom metadata within named domains.sidecar
(V2 checkpoints): References sidecar files containing file actions (used in V2 checkpoints).readerFeatures
and writerFeatures
within the protocol
action in the transaction log. Readers use this information to correctly interpret and process the data. The specific requirements for writers and readers vary depending on the feature. For example, a reader might need to support deletion vectors to correctly handle deleted rows, while a writer might need to generate appropriate metadata for clustered tables.
The Delta Lake specification details the requirements for each feature, ensuring interoperability between different implementations and tools. The features are designed to be modular and extensible, allowing for future additions and improvements without breaking compatibility with existing systems. They provide a powerful mechanism for customizing Delta Lake tables to meet the specific needs of different applications and workflows.
Let’s explore some key table features:
VACUUM
command (used for removing tombstones) adheres to the Delta Lake protocol. It helps maintain data consistency and prevents accidental data loss during the cleanup process. It adds a layer of safety to the VACUUM
operation.