Letās walk through what happens when you update a row in a Delta Table, specifically when you update theDocumentation Index
Fetch the complete documentation index at: https://rajanand.org/llms.txt
Use this file to discover all available pages before exploring further.
name in one of the rows. Weāll use the same example table and go step by step, including how the transaction log (_delta_log) and Parquet files are updated.
Scenario
-
You have a Delta Table with the following data stored in two Parquet files:
1.parquet: Contains rows(1, "Arun")and(2, "Bala").2.parquet: Contains row(3, "Raj").
-
The transaction log (
_delta_log) has two entries:0.json: Records the creation of1.parquet.1.json: Records the addition of2.parquet.
-
You want to update the name of the row with
id=2from"Bala"to"Mala".
Step-by-Step Process
1. Initial State
- The table currently looks like this:
| id | name |
|---|---|
| 1 | Arun |
| 2 | Bala |
| 3 | Raj |
- The transaction log reflects the current state of the table:
0.json:1.parquetis added.1.json:2.parquetis added.
2. Updating the Row
When you update the row withid=2 to change the name from "Bala" to "Mala", the following happens:
- Identify the File Containing the Row: The row with
id=2is located in1.parquet. - Mark the Row as Invalid: Delta Tables use a deletion vector to mark the row with
id=2as invalid in1.parquet. This avoids rewriting the entire file. - Create a New Parquet File: A new Parquet file (
3.parquet) is created, containing only the updated row:(2, "Mala"). - Update the Transaction Log: A new transaction log entry (
2.json) is created to record the changes:1.parquetis āremovedā (marked as invalid for the row withid=2).1.parquetis āre-addedā with a deletion vector specifying that the row withid=2should be excluded.3.parquetis added, containing the updated row.
2.json
3. Final State
After the update:-
The table now has three Parquet files:
1.parquet: Contains the original rows(1, "Arun")and(2, "Bala"), but the row withid=2is marked as invalid.2.parquet: Contains the original row(3, "Raj").3.parquet: Contains the updated row(2, "Mala").
-
The transaction log (
_delta_log) now has three entries:0.json: Records the creation of1.parquet.1.json: Records the addition of2.parquet.2.json: Records the update (removal, re-addition with deletion vector, and addition of3.parquet).
How Queries Work
When you query the table after the update:- The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
- The query reads:
- All valid rows from
1.parquet(excluding the row withid=2). - All rows from
2.parquet. - The updated row from
3.parquet.
- All valid rows from
- The result is a combined view of the data:
| id | name | |
|---|---|---|
| 1 | Arun | |
| 2 | Mala | (updated row from 3.parquet) |
| 3 | Raj |
Why Not Just Rewrite the File?
You might wonder why Delta Tables donāt simply rewrite1.parquet with the updated row. Hereās why:
-
Immutability of Parquet Files:
- Parquet files are immutable, meaning they cannot be modified once written. Updating a file would break this immutability.
-
Efficiency:
- Rewriting an entire Parquet file for a single row update would be extremely inefficient, especially for large files.
- By using deletion vectors and adding new files, Delta Tables minimize the amount of data that needs to be rewritten.
-
Concurrency and Consistency:
- Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while updates are happening.
- Rewriting files would make it much harder to maintain this consistency.
-
Time Travel:
- Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
- Rewriting files would make it difficult to maintain this history.
Summary
When you update a row in a Delta Table:- The existing row is marked as invalid using a deletion vector.
- A new Parquet file is created to store the updated row.
- The transaction log is updated to reflect these changes.
- Queries combine data from all relevant Parquet files, excluding invalid rows, to provide a consistent view of the table.