Delta Lake Protocol
The Delta Lake protocol offers ACID properties for managing large datasets stored as files in distributed file systems or object stores. Its key design goals include:
- Serializable ACID Writes: Ensuring transactions are processed in a consistent and reliable manner, even with concurrent writes.
- Snapshot Isolation for Reads: Readers always see a consistent snapshot of the data, regardless of ongoing writes.
- Scalability: Designed to handle large datasets and high write throughput.
- Self-Description: The protocol’s metadata allows for easy understanding of the table’s structure and state.
- Support for Incremental Processing: Efficiently processes only changes since the last read.
Delta Lake uses a multi-version concurrency control (MVCC) mechanism. The core components are:
1.Optimistic Concurrency Control (OCC)
Writers create new data files and then commit changes by appending a log entry to the transaction log. This log entry details files to add or remove, and any metadata modifications. Delta Lake employs OCC to handle concurrent writes efficiently. Instead of locking data, writers optimistically assume that no conflicts will occur. A writer performs the following steps:
Write Phase: The writer creates new data files containing the updated data. These files are written to the storage location but are not yet considered part of the table’s official state.
Commit Phase: Once the writer has finished writing all the new data files, it commits the changes by appending a transaction log entry. This entry describes the changes made (files added, files removed, metadata updates) and includes a unique transaction ID.
Conflict Detection: The transaction log is used to detect conflicts. If another writer has modified the same data during the first writer’s write phase, a conflict is detected during the commit phase. In case of a conflict, the committing writer’s transaction is aborted, and it needs to retry the operation. The specific conflict resolution strategy depends on the implementation.
OCC avoids the performance overhead of traditional locking mechanisms, making it suitable for high-concurrency environments. However, it does introduce the possibility of conflicts and the need for retry mechanisms.
2. Transaction Log
The transaction log is the central component of Delta Lake’s architecture. It’s an append-only log that records every action performed on the table. Each entry represents a transaction and includes details like added/removed files, metadata updates, and transaction identifiers. Checkpoints within this log provide efficient read access by allowing readers to avoid scanning the entire log. Different checkpoint types exist (UUID-named, classic, and deprecated multi-part).
Each entry in the log represents a committed transaction and includes:
- Transaction ID: A unique identifier for the transaction.
- Actions: A list of actions performed during the transaction (e.g., adding files, removing files, updating metadata).
- Timestamp: The time the transaction was committed.
- Other Metadata: Additional information relevant to the transaction.
The transaction log is crucial for:
- Reproducing Table State: By replaying the log entries, one can reconstruct the table’s state at any point in time.
- Maintaining Consistency: The log ensures that all changes are recorded and applied in the correct order, maintaining data consistency.
- Supporting Snapshot Isolation: The log allows readers to construct consistent snapshots of the data.
- Enabling Data Recovery: In case of failures, the log can be used to recover the table’s state.
3. Snapshot Isolation
Readers in Delta Lake always see a consistent snapshot of the data, regardless of concurrent writes. This is achieved by:
- Reading the Transaction Log: A reader reads the transaction log to determine the latest committed version of the table.
- Constructing a Snapshot: Based on the latest committed version, the reader identifies the set of data files that constitute the consistent snapshot.
- Reading Data Files: The reader then reads the data from the identified data files.
Snapshot isolation ensures that readers never see partially written or inconsistent data. It guarantees that each read operation sees a consistent view of the table, even if writes are occurring concurrently.
4. Lazy Deletion
Instead of immediately removing deleted files, Delta Lake employs lazy deletion. When a file is deleted, it’s not physically removed from the storage. Instead, they are marked as “tombstones” in the transaction log. The actual removal of the physical files is performed later by a separate process (typically the vacuum
command).
Lazy deletion offers several advantages:
- Improved Performance: Avoiding immediate file deletion improves write performance.
- Simplified Recovery: Tombstoned files can be easily recovered if needed.
- Reduced Storage Overhead (eventually): The
vacuum
command removes the physical files, reclaiming storage space.
5. Checkpoints
Checkpoints are periodic snapshots of the transaction log. They contain a summarized representation of the log entries up to a specific point in time. Checkpoints significantly improve read performance because readers can use them to avoid scanning the entire transaction log. They provide a more efficient way to reconstruct the table’s state at a given version. Different checkpoint types exist (UUID-named, classic, and deprecated multi-part), each with its own characteristics.
6.File Types
Delta Lake uses several file types to manage its data and metadata efficiently. These files are organized within the table’s directory structure.
- Data Files: These files contain the actual table data. They can be stored in the root directory or in subdirectories, often organized by partitioning schemes (e.g.,
part1=value1/part2=value2/...
). The partitioning is for organizational purposes and isn’t strictly required by the protocol itself; the actual partition values are tracked in the transaction log. - Deletion Vector Files: Located in the root directory, these files track deleted rows. Instead of physically removing data, Delta Lake uses these vectors to mark rows as deleted, improving performance and allowing for efficient data recovery. Each vector is associated with a specific data file and indicates which rows within that file are considered deleted.
- Change Data Files (CDC): Stored in the
_change_data
directory, these files record changes made to the table in a specific version. This is useful for change data capture (CDC) applications, allowing efficient tracking of data modifications over time. - Delta Log Entries (JSON files): Found in the
_delta_log
directory, these JSON files record every action performed on the table. This log is crucial for maintaining the table’s history and enabling consistent reads and writes. - Checkpoints (Parquet files): Also residing in
_delta_log
, checkpoints are snapshots of the table’s state at a specific version. They provide a more efficient way to read the table’s data than processing the entire transaction log, significantly improving read performance. Different checkpoint types exist (UUID-named, classic, and deprecated multi-part). - Log Compaction Files: These files, also in
_delta_log
, aggregate actions from a range of commits. They help reduce the size and improve the efficiency of the transaction log over time. - Last Checkpoint File: Located at
_delta_log/_last_checkpoint
, this file points to the most recent checkpoint, providing quick access to the latest table state. - Version Checksum File (optional): This file, if present in
_delta_log
, contains checksums to verify the integrity of the table’s data and metadata.
7. Table State
The state of a Delta table at any given version (snapshot) is defined by the following:
- Protocol Version: Specifies the version of the Delta Lake protocol used.
- Table Features: A set of features enabled for the table, influencing its behavior and capabilities (e.g., support for specific data types, partitioning schemes, or compatibility with other systems).
- Metadata: This includes schema information (column names, data types), table properties (e.g., partition columns), and other descriptive information about the table.
- Set of Files: A list of all data files, deletion vector files, and change data files that constitute the table at that version. This includes their paths and metadata.
- Set of Tombstones: A list of files that have been logically deleted but not yet physically removed.
- Set of Application-Specific Transactions: Information about successfully committed transactions, providing provenance and context for the data.
8. Actions
Actions are the fundamental operations that modify the table’s state. They are recorded in the Delta log and checkpoints. Key actions include:
metaData
: Modifies the table’s metadata (schema, properties, etc.).add
andremove
: Add or remove data files or other files from the table.addCDCFile
: Adds a change data file.txn
: Records application-specific transaction identifiers, linking actions to specific transactions.protocol
: Updates the protocol version used by the table.commitInfo
: Stores information about the commit operation, such as timestamps and committer information.domainMetadata
: Allows for storing custom metadata within named domains.sidecar
(V2 checkpoints): References sidecar files containing file actions (used in V2 checkpoints).
9. Table features
Delta Lake’s table features extend its core functionality, providing advanced capabilities and compatibility with various data processing tools and workflows. These features are enabled on a per-table basis and are specified within the table’s metadata. They are controlled through the readerFeatures
and writerFeatures
within the protocol
action in the transaction log. Readers use this information to correctly interpret and process the data. The specific requirements for writers and readers vary depending on the feature. For example, a reader might need to support deletion vectors to correctly handle deleted rows, while a writer might need to generate appropriate metadata for clustered tables.
The Delta Lake specification details the requirements for each feature, ensuring interoperability between different implementations and tools. The features are designed to be modular and extensible, allowing for future additions and improvements without breaking compatibility with existing systems. They provide a powerful mechanism for customizing Delta Lake tables to meet the specific needs of different applications and workflows.
Let’s explore some key table features:
- Column Mapping: This feature allows for renaming or retyping columns within a Delta table. This is particularly useful when dealing with schema evolution, allowing you to maintain compatibility with older data while adding or modifying columns. It enables seamless transitions between different schema versions without requiring data rewriting.
- Deletion Vectors: As previously discussed, deletion vectors are a crucial part of Delta Lake’s efficient handling of deleted rows. This feature enables the tracking of deleted rows without physically removing them, improving performance and simplifying data recovery. The deletion vectors are stored separately from the data files, allowing for efficient filtering of deleted rows during read operations.
- Iceberg Compatibility (V1 and V2): Delta Lake offers compatibility with the Iceberg table format, a popular open-source format for large-scale data lakes. This interoperability allows for easier data exchange and integration between Delta Lake and Iceberg-based systems. The compatibility modes (V1 and V2) refer to different versions of the Iceberg specification.
- Timestamp Without Timezone: This feature allows for storing timestamps without timezone information. This is important for applications where timezone information is not relevant or where handling timezones would add unnecessary complexity. It simplifies data storage and processing in such cases.
- V2 Checkpoints: Delta Lake’s V2 checkpoints offer improvements over the classic checkpoint format. They provide enhanced performance and efficiency in reading and reconstructing the table’s state. The improvements often involve optimized data structures and more efficient encoding.
- Row Tracking: This feature allows for tracking individual rows within the table. This is useful for auditing purposes, lineage tracking, and other applications where it’s important to know the history of individual rows. It provides a detailed record of row-level changes over time.
- VACUUM Protocol Check: This feature ensures that the
VACUUM
command (used for removing tombstones) adheres to the Delta Lake protocol. It helps maintain data consistency and prevents accidental data loss during the cleanup process. It adds a layer of safety to theVACUUM
operation. - Clustered Tables: This feature allows for organizing data within the table based on clustering keys. This improves query performance by physically co-locating related data. It’s particularly beneficial for analytical workloads where data is frequently accessed based on specific criteria.
Summary
Delta Lake uses a sophisticated combination of a transaction log, MVCC, and checkpoints to provide a robust and scalable solution for managing large datasets with ACID properties.