Introduction to databricks

  • Created by the founders of Apache spark, Delta lake and mlflow.
  • Data lakehouse architecture which decompose the compute from the storage.
  • Databricks combines the best of datawarehouse and data lake.
  • It provides a simple, open and unified data platform as a SAAS.
  • It is simple to deploy and maintain. It can be deployed in aws, azure or gcp.

Datawarehouse

Pros:

  • Business Intelligence
  • Analytics
  • Structured and clean data.
  • Single source of truth.
  • Predefined schemas.

Cons:

  • Semi or unstructured data is not supported.
  • Inflexible schemas
  • Long processing time.
  • Struggles with volume and velocity upticks.

Data Lakes

Pros:

  • Flexible data storage. It supports structured, semi-structured and unstructured data.
  • It supports streaming.
  • It supports AI.
  • Cost efficient in cloud.

Cons:

  • No transactional support.
  • Poor data reliablity.
  • Slow analysis performance.
  • Data governance concerns.

Data lakehouse

Key features of a data lakehouse:

  • Transaction support.
  • Schema enforcement and governance.
  • Data governance.
  • BI support.
  • Decoupled storage from compute.
  • Open storage format. (e.g. Apache Parquet etc)
  • Support for diverse data types.
  • Support for diverse workloads
  • End-to-end streaming for real time data application.

Databricks Lakehouse platform

  • One platform to unfiy all your data, analytics and AI workloads.
  • Simple: Unifies your data warehousing and AI use cases on a single platform.
  • Open: Built on open source and open standards.
  • Multicloud: One consistent data platform across clouds.
  • An open and reliable data platform to efficiently handle all data types.
  • One security and governance approach for all data assets on all clouds.
  • All AI, SQL, BI, and streaming use cases.
  • Workloads:
    • Data engineering: Ingest and transform
    • Data warehousing: SQL and business insights
    • Data streaming: Real-time insights
    • Data science: ML to predict outcomes
  • Cloud data lake: All structured and unstructured data deployed either in aws, azure or gcp.
  • Delta lake: It is used for data reliablity and performance.
  • Unity catalog: It is used for data governance.
    • Unified governance solution build into the lakehouse platform.
    • Provides auditing and data lineage capabilities.
    • Secure data sharing with delta sharing integrated directly into unity catalog. Delta sharing is sharing the data without copying from one place to other.
    • Existing tables and views can be upgraded to unity catalog.
  • Persona based services:
    • Data Engineer
    • Data Analyst
    • Data Scientist
  • It makes the specialized team working in silos to unfied team with shared responsibility.
  • Focus areas for a successful data and AI strategy
    • Processes: for handling and using data.
    • People: that make up the data teams.
    • Platform: that will be used

Working with databricks teams

  • Account Executive
  • Solution Architect
  • Delivery Solutions Architect
  • Databricks Professional Services Team

Databricks supports data governance and security

How databricks brings down the total cost of ownership

  • Elimiate cost and complexity.
    • Retire existing software
    • Reduce data processing cost.
    • Data team productivity.
  • Accelerate innovation
    • Streaming, data warehousing, data science in a single platform.
  • Purchase offerings
    • Pay as you go
    • Committed use
  • Product pricing
    • Products are priced by Databricks units (DBU).
  • Optimization and performance
    • Databricks runtimes
    • Delta lake
    • Photon runtime: It is a next-gen query engine that accelerates SQL workloads and custs costs by speeding up data processing tasks.

Databricks Lakehouse Platform Architecture and Security Fundamentals

Problems encountered when using data lakes

  • Lack of ACID transaction support
  • Lack of schema enforcement leading inconsistent and low quality data.
  • Lack of integration with a data catalog. so there is no single truth of data.
  • Data is kept as a blob as immutable file. This leads to issues such as ineffective paritioning and too many small files.

Databricks solves these issues with two technologies.

  • Delta lake
    • File based open source storage format.
    • It supports ACID transaction.
    • Scalable data and metadata handling. It leverages spark to scale out all the metadata processing.
    • Audit history and time travel. It has a transaction log with details about every change to data and the ability to rollback to the earlier versions.
    • Schema enforcement and schema evolution: It prevents the insertion of data with the wrong schema while also allowing the table schema to be explicitly and safely changed.
    • Support for deletes, updates, and merges which is rare for a distributed processing framework to supports. This enabled the change data capture and slowly changed dimension and streaming upserts.
    • Unified streaming and batch data processing
    • Compatible with Apache Spark.
    • It uses delta tables based on Apache Parquet. It is a common format for a structuring data. So you can easily switch from existing parquet table to a delta table.
    • It has a delta lake transaction log.
    • Delta lake is a open source project.
  • Photon
    • Photos is compatible with spark apis.

Unified governance and security

Unity catalog

  • It provides fine-grained row, column and view level access control via SQL.

  • It provides audit train to understand who has performed what action against the data.

  • Built in data search and discovery.

  • Automated lineage for all workloads. Delta sharing Let’s say there is a delta lake table and we are sharing the data using delta sharing using delta sharing server. The delta sharing server has the access permissions. You can share the data without replicating it using the delta sharing protocol to a data recipient. The data recipient could be any tool like power bi, tableau, pandas, spark etc or any use case like BI, analytics, data science or any cloud like azure, aws, gcp or on-prem.

  • Open cross-platform sharing.

  • Share live data without copying it.

  • Centralized administration and governance.

Security

Control plane Data plane Encryption

  • Data-at-rest encryption
  • Data-in-motion encryption

Metastore

Metastore

  • Catalog
    • Schema
      • Tables
        • Managed table.
        • External table.
      • Views
      • Functions
  • Storage Credential
  • External Location
  • Share (related to delta sharing)
  • Recipient (related to delta sharing)