Data Engineer Associate
Databricks Lakehouse Platform
1. Relationship Between Data Lakehouse and Data Warehouse
-
Data Warehouse:
- Traditional structured storage optimized for SQL analytics.
- Uses Extract, Transform, Load (ETL) to process structured data.
- Limited support for unstructured/semi-structured data.
- Examples: Snowflake, Redshift, BigQuery.
-
Data Lakehouse:
- Combines data lake (flexible storage) + data warehouse (performance & governance).
- Supports structured, semi-structured (JSON, Parquet), and unstructured data (images, logs).
- Uses Delta Lake for ACID transactions, schema enforcement, and time travel.
- Enables BI, ML, and streaming on a single platform.
Key Differences:
Feature | Data Warehouse | Data Lakehouse |
---|---|---|
Data Types | Structured only | Structured, semi-structured, unstructured |
Cost | Expensive storage | Cost-effective (object storage) |
Performance | Optimized for SQL | Optimized for SQL + ML + Streaming |
Schema | Schema-on-write | Schema-on-read & schema enforcement |
Reference: Databricks Lakehouse Whitepaper
2. Improvement in Data Quality in Lakehouse Over Data Lake
-
Data Lakes (Traditional Issues):
- No schema enforcement → “Data Swamp” (poor quality).
- No ACID transactions → Dirty reads/writes.
- Limited metadata management → Hard to govern.
-
Lakehouse Improvements (via Delta Lake):
- ACID Transactions: Ensures consistency (e.g., concurrent writes).
- Schema Enforcement: Prevents bad data ingestion.
- Time Travel: Rollback to previous versions (e.g.,
RESTORE TABLE
). - Data Quality Checks: Expectations (e.g.,
CHECK constraints
). - Unified Governance: Fine-grained access control (Row/Column-level).
Example:
Reference: Delta Lake Documentation
3. Silver vs. Gold Tables & Workload Sources
- Medallion Architecture:
- Bronze (Raw): Raw ingested data (e.g., JSON logs, IoT streams).
- Silver (Cleaned): Validated, deduplicated, enriched data.
- Gold (Aggregated): Business-level aggregates (e.g., KPIs, reports).
Table Type | Use Case | Source | Example |
---|---|---|---|
Bronze | Raw ingestion | Kafka, Files | sales_raw |
Silver | Cleaning & Enrichment | Bronze | sales_cleaned |
Gold | Reporting & Aggregation | Silver | sales_monthly_metrics |
Workload Examples:
- Bronze as Source: Stream processing (e.g., Spark Structured Streaming).
- Gold as Source: BI dashboards (Tableau, Power BI).
4. Databricks Platform Architecture (Control vs. Data Plane)
-
Control Plane:
- Managed by Databricks (in their cloud).
- Includes:
- UI/API management.
- Job scheduler.
- User authentication (via Databricks account).
-
Data Plane:
- Runs in customer’s cloud account (AWS/Azure/GCP).
- Includes:
- Compute (Clusters, SQL Warehouses).
- Storage (DBFS, External Blob Storage).
- Delta Lake tables.
Reference: Databricks Architecture
5. All-Purpose vs. Jobs Clusters
Feature | All-Purpose Cluster | Jobs Cluster |
---|---|---|
Use Case | Interactive (Notebooks) | Scheduled/Automated Jobs |
Cost | More expensive (always-on) | Cheaper (terminates after job) |
Access | Multiple users | Single job execution |
Management | Manual start/stop | Auto-terminates |
Example:
- All-Purpose: Data scientists exploring data.
- Jobs Cluster: Nightly ETL pipeline.
6. Cluster Software Versioning (Databricks Runtime)
- Databricks Runtime (DBR): Optimized Spark + extra features (Delta, ML).
- Versions:
DBR 10.4 LTS
,DBR 11.3
(latest). - LTS: Long-Term Support (stable).
- GPU/ML Runtimes: Pre-installed libraries (TensorFlow, PyTorch).
- Versions:
Example:
7. Filtering Accessible Clusters
- Methods:
- UI:
Compute
→ Filter by “Can Attach To”. - API:
GET /api/2.0/clusters/list
→ Checkcan_manage
flag.
- UI:
8. Terminating a Cluster & Impact
-
Termination:
- Manual: UI/API (
DELETE /api/2.0/clusters/terminate
). - Auto-Termination: After idle time (configurable).
- Manual: UI/API (
-
Impact:
- Running Jobs Fail.
- Notebook Sessions Disconnected.
- Storage Persists (DBFS/Delta unaffected).
9. When to Restart a Cluster
- Scenarios:
- Library install requires restart.
- Configuration changes (e.g., autoscaling).
- Cluster becomes unresponsive.
10. Using Multiple Languages in a Notebook
- Magic Commands:
%python
,%sql
,%scala
,%r
.- Example:
11. Running a Notebook from Another Notebook
%run
Command:dbutils.notebook.run()
(for jobs):
12. Sharing Notebooks
- Methods:
- Export:
.dbc
or.ipynb
format. - Permissions:
Can View
,Can Edit
,Can Run
. - Workspace Access Control: Admin-managed.
- Export:
13. Databricks Repos for CI/CD
- Features:
- Git integration (GitHub, GitLab, Bitbucket).
- Sync notebooks with remote repos.
- Branching/Merging directly in UI.
Example Workflow:
- Clone repo →
git checkout feature-branch
. - Develop → Commit → Push.
- Open PR → Merge to
main
.
14. Git Operations in Databricks Repos
- Supported:
git clone
,pull
,push
,commit
,branch
.
- Not Supported:
git rebase
,submodules
.
15. Notebook Version Control Limitations (vs. Repos)
- Notebook History:
- Only saves local revisions (no Git integration).
- Repos Advantage:
- Full Git history.
- Collaboration (PRs, branching).
Reference: Databricks Repos Docs