Let’s walk through what happens when you update a row in a Delta Table, specifically when you update the name in one of the rows. We’ll use the same example table and go step by step, including how the transaction log (_delta_log) and Parquet files are updated.

Scenario

  • You have a Delta Table with the following data stored in two Parquet files:

    • 1.parquet: Contains rows (1, "Arun") and (2, "Bala").
    • 2.parquet: Contains row (3, "Raj").
  • The transaction log (_delta_log) has two entries:

    • 0.json: Records the creation of 1.parquet.
    • 1.json: Records the addition of 2.parquet.
  • You want to update the name of the row with id=2 from "Bala" to "Mala".

Step-by-Step Process

1. Initial State

  • The table currently looks like this:
idname
1Arun
2Bala
3Raj
  • The transaction log reflects the current state of the table:
    • 0.json: 1.parquet is added.
    • 1.json: 2.parquet is added.

2. Updating the Row

When you update the row with id=2 to change the name from "Bala" to "Mala", the following happens:

Update operation on a delta table

  1. Identify the File Containing the Row: The row with id=2 is located in 1.parquet.
  2. Mark the Row as Invalid: Delta Tables use a deletion vector to mark the row with id=2 as invalid in 1.parquet. This avoids rewriting the entire file.
  3. Create a New Parquet File: A new Parquet file (3.parquet) is created, containing only the updated row: (2, "Mala").
  4. Update the Transaction Log: A new transaction log entry (2.json) is created to record the changes:
    • 1.parquet is “removed” (marked as invalid for the row with id=2).
    • 1.parquet is “re-added” with a deletion vector specifying that the row with id=2 should be excluded.
    • 3.parquet is added, containing the updated row.
2.json
{
  "version": 2,
  "actions": [
    {
      "remove": {
        "path": "1.parquet",
        "dataChange": true
      }
    },
    {
      "add": {
        "path": "1.parquet",
        "size": 1234,
        "dataChange": true,
        "deletionVector": "{\"id\": 2}" // Marks row with id=2 as invalid
      }
    },
    {
      "add": {
        "path": "3.parquet",
        "size": 567,
        "dataChange": true,
        "stats": "{\"numRecords\": 1}" // Contains the updated row
      }
    }
  ]
}

3. Final State

After the update:

  • The table now has three Parquet files:

    • 1.parquet: Contains the original rows (1, "Arun") and (2, "Bala"), but the row with id=2 is marked as invalid.
    • 2.parquet: Contains the original row (3, "Raj").
    • 3.parquet: Contains the updated row (2, "Mala").
  • The transaction log (_delta_log) now has three entries:

    • 0.json: Records the creation of 1.parquet.
    • 1.json: Records the addition of 2.parquet.
    • 2.json: Records the update (removal, re-addition with deletion vector, and addition of 3.parquet).

How Queries Work

When you query the table after the update:

  1. The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
  2. The query reads:
    • All valid rows from 1.parquet (excluding the row with id=2).
    • All rows from 2.parquet.
    • The updated row from 3.parquet.
  3. The result is a combined view of the data:
idname
1Arun
2Mala(updated row from 3.parquet)
3Raj

Why Not Just Rewrite the File?

You might wonder why Delta Tables don’t simply rewrite 1.parquet with the updated row. Here’s why:

  1. Immutability of Parquet Files:

    • Parquet files are immutable, meaning they cannot be modified once written. Updating a file would break this immutability.
  2. Efficiency:

    • Rewriting an entire Parquet file for a single row update would be extremely inefficient, especially for large files.
    • By using deletion vectors and adding new files, Delta Tables minimize the amount of data that needs to be rewritten.
  3. Concurrency and Consistency:

    • Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while updates are happening.
    • Rewriting files would make it much harder to maintain this consistency.
  4. Time Travel:

    • Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
    • Rewriting files would make it difficult to maintain this history.

Summary

When you update a row in a Delta Table:

  1. The existing row is marked as invalid using a deletion vector.
  2. A new Parquet file is created to store the updated row.
  3. The transaction log is updated to reflect these changes.
  4. Queries combine data from all relevant Parquet files, excluding invalid rows, to provide a consistent view of the table.