Let’s walk through an example of what happens when a new row is added to a Delta Table. We’ll go step by step, including how the transaction log (_delta_log) and Parquet files are updated.

Scenario

  • You have a Delta Table with the following data stored in a single Parquet file (1.parquet):
idname
1Arun
2Bala
  • You want to add a new row to the table: (3, "Raj").

Step-by-Step Process

1. Initial State

  • The table is created with 1.parquet, and the transaction log (0.json) records this initial state:
0.json
{
  "version": 0,
  "actions": [
    {
      "add": {
        "path": "1.parquet",
        "size": 1234, // size of the file
        "dataChange": true,
        "stats": "{\"numRecords\": 2}" // stats about the file
      }
    }
  ]
}
  • At this point, the table contains only 1.parquet, and the transaction log (0.json) reflects that.

2. Adding a New Row

When you add the new row (3, "Raj"), the following happens:

Insert operation on a delta table

  1. A new Parquet file (2.parquet) is created: This file contains only the new row: (3, "Raj").

  2. The transaction log is updated: A new transaction log entry (1.json) is created to record the addition of 2.parquet.

1.json
{
  "version": 1,
  "actions": [
    {
      "add": {
        "path": "2.parquet",
        "size": 567, // size of the new file
        "dataChange": true,
        "stats": "{\"numRecords\": 1}" // stats about the new file
      }
    }
  ]
}

3. Final State

After the new row is added: The table now has two Parquet files:

  • 1.parquet: Contains the original rows (1, "Arun") and (2, "Bala").

  • 2.parquet: Contains the new row (3, "Raj").

  • The transaction log (_delta_log) now has two entries:

    • 0.json: Records the creation of 1.parquet.
    • 1.json: Records the addition of 2.parquet.

How Queries Work

When you query the table after adding the new row:

  1. The transaction log is consulted to determine which Parquet files are part of the current version of the table.
  2. The query reads data from both 1.parquet and 2.parquet.
  3. The result is a combined view of the data:
idname
1Arun
2Bala
3Raj

Why Not Append to the Existing File?

You might wonder why Delta Tables don’t simply append the new row to the existing 1.parquet. There are a few reasons for this:

  1. Immutability of Parquet Files: Parquet files are immutable, meaning they cannot be modified once written. Appending to an existing file would break this immutability.
  2. Efficiency: Writing a new file is often more efficient than rewriting an existing file, especially for large datasets.
  3. Concurrency: Delta Tables are designed to handle concurrent reads and writes. Writing a new file ensures that readers can continue to access the existing data without interruption.
  4. Time Travel: By creating a new file and updating the transaction log, Delta Tables maintain a history of changes, enabling features like time travel (querying previous versions of the table).

Summary

When a new row is added to a Delta Table:

  1. A new Parquet file is created to store the new row.
  2. The transaction log is updated to record the addition of the new file.
  3. Queries combine data from all relevant Parquet files to provide a consistent view of the table.