1. Relationship Between Data Lakehouse and Data Warehouse

  • Data Warehouse:

    • Traditional structured storage optimized for SQL analytics.
    • Uses Extract, Transform, Load (ETL) to process structured data.
    • Limited support for unstructured/semi-structured data.
    • Examples: Snowflake, Redshift, BigQuery.
  • Data Lakehouse:

    • Combines data lake (flexible storage) + data warehouse (performance & governance).
    • Supports structured, semi-structured (JSON, Parquet), and unstructured data (images, logs).
    • Uses Delta Lake for ACID transactions, schema enforcement, and time travel.
    • Enables BI, ML, and streaming on a single platform.

Key Differences:

FeatureData WarehouseData Lakehouse
Data TypesStructured onlyStructured, semi-structured, unstructured
CostExpensive storageCost-effective (object storage)
PerformanceOptimized for SQLOptimized for SQL + ML + Streaming
SchemaSchema-on-writeSchema-on-read & schema enforcement

Reference: Databricks Lakehouse Whitepaper


2. Improvement in Data Quality in Lakehouse Over Data Lake

  • Data Lakes (Traditional Issues):

    • No schema enforcement → “Data Swamp” (poor quality).
    • No ACID transactions → Dirty reads/writes.
    • Limited metadata management → Hard to govern.
  • Lakehouse Improvements (via Delta Lake):

    • ACID Transactions: Ensures consistency (e.g., concurrent writes).
    • Schema Enforcement: Prevents bad data ingestion.
    • Time Travel: Rollback to previous versions (e.g., RESTORE TABLE).
    • Data Quality Checks: Expectations (e.g., CHECK constraints).
    • Unified Governance: Fine-grained access control (Row/Column-level).

Example:

# Enforcing schema in Delta Lake
df.write.format("delta").mode("overwrite").option("mergeSchema", "true").save("/mnt/delta/events")

Reference: Delta Lake Documentation


3. Silver vs. Gold Tables & Workload Sources

  • Medallion Architecture:
    • Bronze (Raw): Raw ingested data (e.g., JSON logs, IoT streams).
    • Silver (Cleaned): Validated, deduplicated, enriched data.
    • Gold (Aggregated): Business-level aggregates (e.g., KPIs, reports).
Table TypeUse CaseSourceExample
BronzeRaw ingestionKafka, Filessales_raw
SilverCleaning & EnrichmentBronzesales_cleaned
GoldReporting & AggregationSilversales_monthly_metrics

Workload Examples:

  • Bronze as Source: Stream processing (e.g., Spark Structured Streaming).
  • Gold as Source: BI dashboards (Tableau, Power BI).

4. Databricks Platform Architecture (Control vs. Data Plane)

  • Control Plane:

    • Managed by Databricks (in their cloud).
    • Includes:
      • UI/API management.
      • Job scheduler.
      • User authentication (via Databricks account).
  • Data Plane:

    • Runs in customer’s cloud account (AWS/Azure/GCP).
    • Includes:
      • Compute (Clusters, SQL Warehouses).
      • Storage (DBFS, External Blob Storage).
      • Delta Lake tables.

Reference: Databricks Architecture


5. All-Purpose vs. Jobs Clusters

FeatureAll-Purpose ClusterJobs Cluster
Use CaseInteractive (Notebooks)Scheduled/Automated Jobs
CostMore expensive (always-on)Cheaper (terminates after job)
AccessMultiple usersSingle job execution
ManagementManual start/stopAuto-terminates

Example:

  • All-Purpose: Data scientists exploring data.
  • Jobs Cluster: Nightly ETL pipeline.

6. Cluster Software Versioning (Databricks Runtime)

  • Databricks Runtime (DBR): Optimized Spark + extra features (Delta, ML).
    • Versions: DBR 10.4 LTS, DBR 11.3 (latest).
    • LTS: Long-Term Support (stable).
    • GPU/ML Runtimes: Pre-installed libraries (TensorFlow, PyTorch).

Example:

# Cluster config: DBR 11.3 (Spark 3.3 + Delta 2.2)

7. Filtering Accessible Clusters

  • Methods:
    • UI: Compute → Filter by “Can Attach To”.
    • API: GET /api/2.0/clusters/list → Check can_manage flag.

8. Terminating a Cluster & Impact

  • Termination:

    • Manual: UI/API (DELETE /api/2.0/clusters/terminate).
    • Auto-Termination: After idle time (configurable).
  • Impact:

    • Running Jobs Fail.
    • Notebook Sessions Disconnected.
    • Storage Persists (DBFS/Delta unaffected).

9. When to Restart a Cluster

  • Scenarios:
    • Library install requires restart.
    • Configuration changes (e.g., autoscaling).
    • Cluster becomes unresponsive.

10. Using Multiple Languages in a Notebook

  • Magic Commands:
    • %python, %sql, %scala, %r.
    • Example:
      %sql
      SELECT * FROM sales_cleaned LIMIT 10;
      
      %python
      df = spark.table("sales_cleaned")
      

11. Running a Notebook from Another Notebook

  • %run Command:
    %run "/Shared/notebook1"
    
  • dbutils.notebook.run() (for jobs):
    dbutils.notebook.run("notebook1", timeout_seconds=60, arguments={"param1": "value1"})
    

12. Sharing Notebooks

  • Methods:
    • Export: .dbc or .ipynb format.
    • Permissions: Can View, Can Edit, Can Run.
    • Workspace Access Control: Admin-managed.

13. Databricks Repos for CI/CD

  • Features:
    • Git integration (GitHub, GitLab, Bitbucket).
    • Sync notebooks with remote repos.
    • Branching/Merging directly in UI.

Example Workflow:

  1. Clone repo → git checkout feature-branch.
  2. Develop → Commit → Push.
  3. Open PR → Merge to main.

14. Git Operations in Databricks Repos

  • Supported:
    • git clone, pull, push, commit, branch.
  • Not Supported:
    • git rebase, submodules.

15. Notebook Version Control Limitations (vs. Repos)

  • Notebook History:
    • Only saves local revisions (no Git integration).
  • Repos Advantage:
    • Full Git history.
    • Collaboration (PRs, branching).

Reference: Databricks Repos Docs