> ## Documentation Index
> Fetch the complete documentation index at: https://rajanand.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Update in Delta Table

> How an update operation works in a Delta Lake table.

Let’s walk through what happens when you **update a row** in a Delta Table, specifically when you update the `name` in one of the rows. We’ll use the same example table and go step by step, including how the transaction log (`_delta_log`) and Parquet files are updated.

### Scenario

* You have a Delta Table with the following data stored in two Parquet files:
  * `1.parquet`: Contains rows `(1, "Arun")` and `(2, "Bala")`.
  * `2.parquet`: Contains row `(3, "Raj")`.

* The transaction log (`_delta_log`) has two entries:
  * `0.json`: Records the creation of `1.parquet`.
  * `1.json`: Records the addition of `2.parquet`.

* You want to **update the name** of the row with `id=2` from `"Bala"` to `"Mala"`.

### Step-by-Step Process

#### 1. Initial State

* The table currently looks like this:

| id | name |
| -- | ---- |
| 1  | Arun |
| 2  | Bala |
| 3  | Raj  |

* The transaction log reflects the current state of the table:
  * `0.json`: `1.parquet` is added.
  * `1.json`: `2.parquet` is added.

#### 2. Updating the Row

When you update the row with `id=2` to change the name from `"Bala"` to `"Mala"`, the following happens:

<Frame caption="Update operation on a delta table">
  <img src="https://mintcdn.com/rajanand/CvUuxxe_kZPoEvJ-/assets/data/delta-table-update.svg?fit=max&auto=format&n=CvUuxxe_kZPoEvJ-&q=85&s=d3835e2a1e25eb471370ec8006fa798f" alt="Delta table update" width="1442" height="1409" data-path="assets/data/delta-table-update.svg" />
</Frame>

1. **Identify the File Containing the Row**: The row with `id=2` is located in `1.parquet`.
2. **Mark the Row as Invalid**: Delta Tables use a **deletion vector** to mark the row with `id=2` as invalid in `1.parquet`. This avoids rewriting the entire file.
3. **Create a New Parquet File**: A new Parquet file (`3.parquet`) is created, containing only the updated row: `(2, "Mala")`.
4. **Update the Transaction Log**: A new transaction log entry (`2.json`) is created to record the changes:
   * `1.parquet` is "removed" (marked as invalid for the row with `id=2`).
   * `1.parquet` is "re-added" with a deletion vector specifying that the row with `id=2` should be excluded.
   * `3.parquet` is added, containing the updated row.

```json 2.json theme={"system"}
{
  "version": 2,
  "actions": [
    {
      "remove": {
        "path": "1.parquet",
        "dataChange": true
      }
    },
    {
      "add": {
        "path": "1.parquet",
        "size": 1234,
        "dataChange": true,
        "deletionVector": "{\"id\": 2}" // Marks row with id=2 as invalid
      }
    },
    {
      "add": {
        "path": "3.parquet",
        "size": 567,
        "dataChange": true,
        "stats": "{\"numRecords\": 1}" // Contains the updated row
      }
    }
  ]
}
```

#### 3. Final State

After the update:

* The table now has three Parquet files:
  * `1.parquet`: Contains the original rows `(1, "Arun")` and `(2, "Bala")`, but the row with `id=2` is marked as invalid.
  * `2.parquet`: Contains the original row `(3, "Raj")`.
  * `3.parquet`: Contains the updated row `(2, "Mala")`.

* The transaction log (`_delta_log`) now has three entries:
  * `0.json`: Records the creation of `1.parquet`.
  * `1.json`: Records the addition of `2.parquet`.
  * `2.json`: Records the update (removal, re-addition with deletion vector, and addition of `3.parquet`).

### How Queries Work

When you query the table after the update:

1. The transaction log is consulted to determine which Parquet files are valid and which rows are excluded (via deletion vectors).
2. The query reads:
   * All valid rows from `1.parquet` (excluding the row with `id=2`).
   * All rows from `2.parquet`.
   * The updated row from `3.parquet`.
3. The result is a combined view of the data:

| id | name |                                |
| -- | ---- | ------------------------------ |
| 1  | Arun |                                |
| 2  | Mala | (updated row from `3.parquet`) |
| 3  | Raj  |                                |

### Why Not Just Rewrite the File?

You might wonder why Delta Tables don’t simply rewrite `1.parquet` with the updated row. Here’s why:

1. **Immutability of Parquet Files**:
   * Parquet files are immutable, meaning they cannot be modified once written. Updating a file would break this immutability.

2. **Efficiency**:
   * Rewriting an entire Parquet file for a single row update would be extremely inefficient, especially for large files.
   * By using deletion vectors and adding new files, Delta Tables minimize the amount of data that needs to be rewritten.

3. **Concurrency and Consistency**:
   * Delta Tables are designed to handle concurrent reads and writes. The transaction log ensures that readers always see a consistent snapshot of the data, even while updates are happening.
   * Rewriting files would make it much harder to maintain this consistency.

4. **Time Travel**:
   * Delta Tables support time travel, allowing users to query previous versions of the table. This is possible because the transaction log keeps track of all changes over time.
   * Rewriting files would make it difficult to maintain this history.

### Summary

When you update a row in a Delta Table:

1. The existing row is marked as invalid using a **deletion vector**.
2. A new Parquet file is created to store the updated row.
3. The transaction log is updated to reflect these changes.
4. Queries combine data from all relevant Parquet files, excluding invalid rows, to provide a consistent view of the table.

<CardGroup cols={3}>
  <Card title="Insert in Delta Table" icon="arrow-right-to-bracket" href="/data/delta-table-insert" horizontal />

  <Card title="Update in Delta Table" icon="file-pen" href="/data/delta-table-update" horizontal />

  <Card title="Delete in Delta Table" icon="trash" href="/data/delta-table-delete" horizontal />
</CardGroup>
