How an update operation works in a Delta Lake table.
name
in one of the rows. We’ll use the same example table and go step by step, including how the transaction log (_delta_log
) and Parquet files are updated.
1.parquet
: Contains rows (1, "Arun")
and (2, "Bala")
.2.parquet
: Contains row (3, "Raj")
._delta_log
) has two entries:
0.json
: Records the creation of 1.parquet
.1.json
: Records the addition of 2.parquet
.id=2
from "Bala"
to "Mala"
.
id | name |
---|---|
1 | Arun |
2 | Bala |
3 | Raj |
0.json
: 1.parquet
is added.1.json
: 2.parquet
is added.id=2
to change the name from "Bala"
to "Mala"
, the following happens:
Update operation on a delta table
id=2
is located in 1.parquet
.id=2
as invalid in 1.parquet
. This avoids rewriting the entire file.3.parquet
) is created, containing only the updated row: (2, "Mala")
.2.json
) is created to record the changes:
1.parquet
is “removed” (marked as invalid for the row with id=2
).1.parquet
is “re-added” with a deletion vector specifying that the row with id=2
should be excluded.3.parquet
is added, containing the updated row.1.parquet
: Contains the original rows (1, "Arun")
and (2, "Bala")
, but the row with id=2
is marked as invalid.2.parquet
: Contains the original row (3, "Raj")
.3.parquet
: Contains the updated row (2, "Mala")
._delta_log
) now has three entries:
0.json
: Records the creation of 1.parquet
.1.json
: Records the addition of 2.parquet
.2.json
: Records the update (removal, re-addition with deletion vector, and addition of 3.parquet
).1.parquet
(excluding the row with id=2
).2.parquet
.3.parquet
.id | name | |
---|---|---|
1 | Arun | |
2 | Mala | (updated row from 3.parquet ) |
3 | Raj |
1.parquet
with the updated row. Here’s why: