INSERT
, UPDATE
, MERGE
, DELETE
._delta_log/
(JSON files).Data | Metadata |
---|---|
Raw content (e.g., rows in a table). | Information about data (e.g., schema, partitions). |
Stored in Parquet files. | Stored in Delta Log (_delta_log/ ). |
Example: {"id": 1, "name": "Alice"} . | Example: {"schema": "id INT, name STRING"} . |
Managed Table | External Table |
---|---|
Databricks controls lifecycle (DROP deletes data). | User manages data (DROP only removes metadata). |
Stored in DBFS root (e.g., dbfs:/user/hive/warehouse/ ). | Stored in user-defined cloud storage (e.g., S3, ADLS). |
Use Case: Ephemeral data (e.g., ETL intermediates). | Use Case: Shared datasets (e.g., raw zone in Lakehouse). |
_delta_log/000000N.json
: Transaction metadata..parquet
: Data files (optimized by OPTIMIZE
).CREATE OR REPLACE
vs. INSERT OVERWRITE
CREATE OR REPLACE TABLE | INSERT OVERWRITE TABLE |
---|---|
Drops and recreates table (resets metadata). | Overwrites data but preserves metadata (e.g., partitions). |
Use Case: Schema changes. | Use Case: Full refreshes without dropping. |
INSERT
/UPDATE
).Triggered | Continuous |
---|---|
Runs on-demand/schedule. | Real-time (low latency). |
Cost: Lower (no always-on cluster). | Cost: Higher (always-running). |
cloudFiles
format.
STRING
for flexibility.schemaEvolutionMode
.ON VIOLATION FAIL UPDATE
: Aborts job on violation.ON VIOLATION DROP ROW
: Silently drops invalid rows.LIVE
Keyword:
STREAM
for Streaming:
Events
tab in DLT UI for failed notebooks.UPDATE
fails midway, no partial changes are committed.INSERT
, UPDATE
, MERGE
, DELETE
)ALTER TABLE
){"id": 1, "name": "Alice"}
).Data | Metadata |
---|---|
Stored in Parquet files | Stored in Delta Log (_delta_log/ ) |
Large volume (TB/PB) | Small (KB/MB) |
Accessed via queries | Managed by Delta Lake |
DESCRIBE TABLE
, youβre viewing metadata. When you SELECT *
, youβre accessing data.
DROP TABLE
) deletes the underlying files.dbfs:/user/hive/warehouse/
).s3://bucket/path/
)._delta_log/
):
.parquet
):
ZORDER
and OPTIMIZE
.INSERT
, UPDATE
, and DELETE
in a single atomic transaction.
Triggered | Continuous |
---|---|
Runs on schedule | Real-time processing |
Lower cost | Higher cost (always-on cluster) |
Higher latency | Low latency |