Databricks Lakehouse Platform

1. Relationship Between Data Lakehouse and Data Warehouse

Data Warehouse:
- Traditional structured storage optimized for SQL analytics.
- Uses Extract, Transform, Load (ETL) to process structured data.
- Limited support for unstructured/semi-structured data.
- Examples: Snowflake, Redshift, BigQuery.
Data Lakehouse:
- Combines data lake (flexible storage) + data warehouse (performance & governance).
- Supports structured, semi-structured (JSON, Parquet), and unstructured data (images, logs).
- Uses Delta Lake for ACID transactions, schema enforcement, and time travel.
- Enables BI, ML, and streaming on a single platform.

Key Differences:

Feature	Data Warehouse	Data Lakehouse
Data Types	Structured only	Structured, semi-structured, unstructured
Cost	Expensive storage	Cost-effective (object storage)
Performance	Optimized for SQL	Optimized for SQL + ML + Streaming
Schema	Schema-on-write	Schema-on-read & schema enforcement

Reference: Databricks Lakehouse Whitepaper

2. Improvement in Data Quality in Lakehouse Over Data Lake

Data Lakes (Traditional Issues):
- No schema enforcement → “Data Swamp” (poor quality).
- No ACID transactions → Dirty reads/writes.
- Limited metadata management → Hard to govern.
Lakehouse Improvements (via Delta Lake):
- ACID Transactions: Ensures consistency (e.g., concurrent writes).
- Schema Enforcement: Prevents bad data ingestion.
- Time Travel: Rollback to previous versions (e.g., RESTORE TABLE).
- Data Quality Checks: Expectations (e.g., CHECK constraints).
- Unified Governance: Fine-grained access control (Row/Column-level).

Example:

# Enforcing schema in Delta Lake
df.write.format("delta").mode("overwrite").option("mergeSchema", "true").save("/mnt/delta/events")

Reference: Delta Lake Documentation

3. Silver vs. Gold Tables & Workload Sources

Medallion Architecture:
- Bronze (Raw): Raw ingested data (e.g., JSON logs, IoT streams).
- Silver (Cleaned): Validated, deduplicated, enriched data.
- Gold (Aggregated): Business-level aggregates (e.g., KPIs, reports).

Table Type	Use Case	Source	Example
Bronze	Raw ingestion	Kafka, Files	`sales_raw`
Silver	Cleaning & Enrichment	Bronze	`sales_cleaned`
Gold	Reporting & Aggregation	Silver	`sales_monthly_metrics`

Workload Examples:

Bronze as Source: Stream processing (e.g., Spark Structured Streaming).
Gold as Source: BI dashboards (Tableau, Power BI).

4. Databricks Platform Architecture (Control vs. Data Plane)

Control Plane:
- Managed by Databricks (in their cloud).
- Includes:
  - UI/API management.
  - Job scheduler.
  - User authentication (via Databricks account).
Data Plane:
- Runs in customer’s cloud account (AWS/Azure/GCP).
- Includes:
  - Compute (Clusters, SQL Warehouses).
  - Storage (DBFS, External Blob Storage).
  - Delta Lake tables.

Reference: Databricks Architecture

5. All-Purpose vs. Jobs Clusters

Feature	All-Purpose Cluster	Jobs Cluster
Use Case	Interactive (Notebooks)	Scheduled/Automated Jobs
Cost	More expensive (always-on)	Cheaper (terminates after job)
Access	Multiple users	Single job execution
Management	Manual start/stop	Auto-terminates

Example:

All-Purpose: Data scientists exploring data.
Jobs Cluster: Nightly ETL pipeline.

6. Cluster Software Versioning (Databricks Runtime)

Databricks Runtime (DBR): Optimized Spark + extra features (Delta, ML).
- Versions: DBR 10.4 LTS, DBR 11.3 (latest).
- LTS: Long-Term Support (stable).
- GPU/ML Runtimes: Pre-installed libraries (TensorFlow, PyTorch).

Example:

# Cluster config: DBR 11.3 (Spark 3.3 + Delta 2.2)

7. Filtering Accessible Clusters

Methods:
- UI: Compute → Filter by “Can Attach To”.
- API: GET /api/2.0/clusters/list → Check can_manage flag.

8. Terminating a Cluster & Impact

Termination:
- Manual: UI/API (DELETE /api/2.0/clusters/terminate).
- Auto-Termination: After idle time (configurable).
Impact:
- Running Jobs Fail.
- Notebook Sessions Disconnected.
- Storage Persists (DBFS/Delta unaffected).

9. When to Restart a Cluster

Scenarios:
- Library install requires restart.
- Configuration changes (e.g., autoscaling).
- Cluster becomes unresponsive.

10. Using Multiple Languages in a Notebook

Magic Commands:

%python, %sql, %scala, %r.

Example:

%sql
SELECT * FROM sales_cleaned LIMIT 10;

%python
df = spark.table("sales_cleaned")

11. Running a Notebook from Another Notebook

%run Command:
```
%run "/Shared/notebook1"
```

dbutils.notebook.run() (for jobs):

dbutils.notebook.run("notebook1", timeout_seconds=60, arguments={"param1": "value1"})

Methods:
- Export: .dbc or .ipynb format.
- Permissions: Can View, Can Edit, Can Run.
- Workspace Access Control: Admin-managed.

13. Databricks Repos for CI/CD

Features:
- Git integration (GitHub, GitLab, Bitbucket).
- Sync notebooks with remote repos.
- Branching/Merging directly in UI.

Example Workflow:

Clone repo → git checkout feature-branch.
Develop → Commit → Push.
Open PR → Merge to main.

14. Git Operations in Databricks Repos

Supported:
- git clone, pull, push, commit, branch.
Not Supported:
- git rebase, submodules.

15. Notebook Version Control Limitations (vs. Repos)

Notebook History:
- Only saves local revisions (no Git integration).
Repos Advantage:
- Full Git history.
- Collaboration (PRs, branching).

Reference: Databricks Repos Docs

Data Engineer Associate

​1. Relationship Between Data Lakehouse and Data Warehouse

​2. Improvement in Data Quality in Lakehouse Over Data Lake

​3. Silver vs. Gold Tables & Workload Sources

​4. Databricks Platform Architecture (Control vs. Data Plane)

​5. All-Purpose vs. Jobs Clusters

​6. Cluster Software Versioning (Databricks Runtime)

​7. Filtering Accessible Clusters

​8. Terminating a Cluster & Impact

​9. When to Restart a Cluster

​10. Using Multiple Languages in a Notebook

​11. Running a Notebook from Another Notebook

​12. Sharing Notebooks

​13. Databricks Repos for CI/CD

​14. Git Operations in Databricks Repos

​15. Notebook Version Control Limitations (vs. Repos)

1. Relationship Between Data Lakehouse and Data Warehouse

2. Improvement in Data Quality in Lakehouse Over Data Lake

3. Silver vs. Gold Tables & Workload Sources

4. Databricks Platform Architecture (Control vs. Data Plane)

5. All-Purpose vs. Jobs Clusters

6. Cluster Software Versioning (Databricks Runtime)

7. Filtering Accessible Clusters

8. Terminating a Cluster & Impact

9. When to Restart a Cluster

10. Using Multiple Languages in a Notebook

11. Running a Notebook from Another Notebook

12. Sharing Notebooks

13. Databricks Repos for CI/CD

14. Git Operations in Databricks Repos

15. Notebook Version Control Limitations (vs. Repos)