When you delete a row in a Delta Table, the process is similar to updating a row. Delta Tables use the transaction log (_delta_log) and deletion vectors to mark rows as invalid without physically rewriting the entire Parquet file. Let’s walk through an example to understand what happens step by step.

Scenario

  • You have a Delta Table with the following data stored in two Parquet files:

    • 1.parquet: Contains rows (1, "Arun") and (2, "Bala").
    • 2.parquet: Contains row (3, "Raj").
  • The transaction log (_delta_log) has two entries:

    • 0.json: Records the creation of 1.parquet.
    • 1.json: Records the addition of 2.parquet.
  • You want to delete the row with id=2.

Step-by-Step Process

1. Initial State

  • The table currently looks like this:
idname
1Arun
2Bala
3Raj
  • The transaction log reflects the current state of the table:
    • 0.json: 1.parquet is added.
    • 1.json: 2.parquet is added.

2. Deleting the Row

When you delete the row with id=2, the following happens:

Delete operation on a delta table

  1. Identify the File Containing the Row: The row with id=2 is located in 1.parquet.

  2. Mark the Row as Invalid: Delta Tables use a deletion vector to mark the row with id=2 as invalid in 1.parquet. This avoids rewriting the entire file.

  3. Update the Transaction Log:

    • A new transaction log entry (2.json) is created to record the changes:
      • 1.parquet is “removed” (marked as invalid for the row with id=2).
      • 1.parquet is “re-added” with a deletion vector specifying that the row with id=2 should be excluded.
2.json
{
  "version": 2,
  "actions": [
    {
      "remove": {
        "path": "1.parquet",
        "dataChange": true
      }
    },
    {
      "add": {
        "path": "1.parquet",
        "size": 1234,
        "dataChange": true,
        "deletionVector": "{\"id\": 2}" // Marks row with id=2 as invalid
      }
    }
  ]
}

3. Final State

After the delete operation:

  • The table still has two Parquet files:

    • 1.parquet: Contains the original rows (1, "Arun") and (2, "Bala"), but the row with id=2 is marked as invalid.
    • 2.parquet: Contains the original row (3, "Raj").
  • The transaction log (_delta_log) now has three entries:

    • 0.json: Records the creation of 1.parquet.
    • 1.json: Records the addition of 2.parquet.
    • 2.json: Records the deletion (removal and re-addition of 1.parquet with a deletion vector).

How Queries Work

When you query the table after the delete operation:

  1. The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
  2. The query reads:
    • All valid rows from 1.parquet (excluding the row with id=2).
    • All rows from 2.parquet.
  3. The result is a combined view of the data:
idname
1Arun
3Raj

Why Not Physically Delete the Row?

You might wonder why Delta Tables don’t physically remove the row from 1.parquet. Here’s why:

  1. Immutability of Parquet Files:

    • Parquet files are immutable, meaning they cannot be modified once written. Physically deleting a row would require rewriting the entire file, which is inefficient.
  2. Efficiency:

    • Rewriting an entire Parquet file for a single row deletion would be extremely inefficient, especially for large files.
    • By using deletion vectors, Delta Tables minimize the amount of data that needs to be rewritten.
  3. Concurrency and Consistency:

    • Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while deletions are happening.
    • Physically deleting rows would make it much harder to maintain this consistency.
  4. Time Travel:

    • Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
    • Physically deleting rows would make it difficult to maintain this history.

Summary

When you delete a row in a Delta Table:

  1. The row is marked as invalid using a deletion vector.
  2. The transaction log is updated to reflect this change.
  3. Queries exclude the invalid row when reading data, providing a consistent view of the table.