Insert in Delta Table
How an insert operation works in a Delta Lake table.
Let’s walk through an example of what happens when a new row is added to a Delta Table. We’ll go step by step, including how the transaction log (_delta_log
) and Parquet files are updated.
Scenario
- You have a Delta Table with the following data stored in a single Parquet file (
1.parquet
):
id | name |
---|---|
1 | Arun |
2 | Bala |
- You want to add a new row to the table:
(3, "Raj")
.
Step-by-Step Process
1. Initial State
- The table is created with
1.parquet
, and the transaction log (0.json
) records this initial state:
- At this point, the table contains only
1.parquet
, and the transaction log (0.json
) reflects that.
2. Adding a New Row
When you add the new row (3, "Raj")
, the following happens:
Insert operation on a delta table
-
A new Parquet file (
2.parquet
) is created: This file contains only the new row:(3, "Raj")
. -
The transaction log is updated: A new transaction log entry (
1.json
) is created to record the addition of2.parquet
.
3. Final State
After the new row is added: The table now has two Parquet files:
-
1.parquet
: Contains the original rows(1, "Arun")
and(2, "Bala")
. -
2.parquet
: Contains the new row(3, "Raj")
. -
The transaction log (
_delta_log
) now has two entries:0.json
: Records the creation of1.parquet
.1.json
: Records the addition of2.parquet
.
How Queries Work
When you query the table after adding the new row:
- The transaction log is consulted to determine which Parquet files are part of the current version of the table.
- The query reads data from both
1.parquet
and2.parquet
. - The result is a combined view of the data:
id | name |
---|---|
1 | Arun |
2 | Bala |
3 | Raj |
Why Not Append to the Existing File?
You might wonder why Delta Tables don’t simply append the new row to the existing 1.parquet
. There are a few reasons for this:
- Immutability of Parquet Files: Parquet files are immutable, meaning they cannot be modified once written. Appending to an existing file would break this immutability.
- Efficiency: Writing a new file is often more efficient than rewriting an existing file, especially for large datasets.
- Concurrency: Delta Tables are designed to handle concurrent reads and writes. Writing a new file ensures that readers can continue to access the existing data without interruption.
- Time Travel: By creating a new file and updating the transaction log, Delta Tables maintain a history of changes, enabling features like time travel (querying previous versions of the table).
Summary
When a new row is added to a Delta Table:
- A new Parquet file is created to store the new row.
- The transaction log is updated to record the addition of the new file.
- Queries combine data from all relevant Parquet files to provide a consistent view of the table.