Update in Delta Table
How an update operation works in a Delta Lake table.
Let’s walk through what happens when you update a row in a Delta Table, specifically when you update the name
in one of the rows. We’ll use the same example table and go step by step, including how the transaction log (_delta_log
) and Parquet files are updated.
Scenario
-
You have a Delta Table with the following data stored in two Parquet files:
1.parquet
: Contains rows(1, "Arun")
and(2, "Bala")
.2.parquet
: Contains row(3, "Raj")
.
-
The transaction log (
_delta_log
) has two entries:0.json
: Records the creation of1.parquet
.1.json
: Records the addition of2.parquet
.
-
You want to update the name of the row with
id=2
from"Bala"
to"Mala"
.
Step-by-Step Process
1. Initial State
- The table currently looks like this:
id | name |
---|---|
1 | Arun |
2 | Bala |
3 | Raj |
- The transaction log reflects the current state of the table:
0.json
:1.parquet
is added.1.json
:2.parquet
is added.
2. Updating the Row
When you update the row with id=2
to change the name from "Bala"
to "Mala"
, the following happens:
Update operation on a delta table
- Identify the File Containing the Row: The row with
id=2
is located in1.parquet
. - Mark the Row as Invalid: Delta Tables use a deletion vector to mark the row with
id=2
as invalid in1.parquet
. This avoids rewriting the entire file. - Create a New Parquet File: A new Parquet file (
3.parquet
) is created, containing only the updated row:(2, "Mala")
. - Update the Transaction Log: A new transaction log entry (
2.json
) is created to record the changes:1.parquet
is “removed” (marked as invalid for the row withid=2
).1.parquet
is “re-added” with a deletion vector specifying that the row withid=2
should be excluded.3.parquet
is added, containing the updated row.
3. Final State
After the update:
-
The table now has three Parquet files:
1.parquet
: Contains the original rows(1, "Arun")
and(2, "Bala")
, but the row withid=2
is marked as invalid.2.parquet
: Contains the original row(3, "Raj")
.3.parquet
: Contains the updated row(2, "Mala")
.
-
The transaction log (
_delta_log
) now has three entries:0.json
: Records the creation of1.parquet
.1.json
: Records the addition of2.parquet
.2.json
: Records the update (removal, re-addition with deletion vector, and addition of3.parquet
).
How Queries Work
When you query the table after the update:
- The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
- The query reads:
- All valid rows from
1.parquet
(excluding the row withid=2
). - All rows from
2.parquet
. - The updated row from
3.parquet
.
- All valid rows from
- The result is a combined view of the data:
id | name | |
---|---|---|
1 | Arun | |
2 | Mala | (updated row from 3.parquet ) |
3 | Raj |
Why Not Just Rewrite the File?
You might wonder why Delta Tables don’t simply rewrite 1.parquet
with the updated row. Here’s why:
-
Immutability of Parquet Files:
- Parquet files are immutable, meaning they cannot be modified once written. Updating a file would break this immutability.
-
Efficiency:
- Rewriting an entire Parquet file for a single row update would be extremely inefficient, especially for large files.
- By using deletion vectors and adding new files, Delta Tables minimize the amount of data that needs to be rewritten.
-
Concurrency and Consistency:
- Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while updates are happening.
- Rewriting files would make it much harder to maintain this consistency.
-
Time Travel:
- Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
- Rewriting files would make it difficult to maintain this history.
Summary
When you update a row in a Delta Table:
- The existing row is marked as invalid using a deletion vector.
- A new Parquet file is created to store the updated row.
- The transaction log is updated to reflect these changes.
- Queries combine data from all relevant Parquet files, excluding invalid rows, to provide a consistent view of the table.