One platform to unfiy all your data, analytics and AI workloads.
Simple: Unifies your data warehousing and AI use cases on a single platform.
Open: Built on open source and open standards.
Multicloud: One consistent data platform across clouds.
An open and reliable data platform to efficiently handle all data types.
One security and governance approach for all data assets on all clouds.
All AI, SQL, BI, and streaming use cases.
Workloads:
Data engineering: Ingest and transform
Data warehousing: SQL and business insights
Data streaming: Real-time insights
Data science: ML to predict outcomes
Cloud data lake: All structured and unstructured data deployed either in aws, azure or gcp.
Delta lake: It is used for data reliablity and performance.
Unity catalog: It is used for data governance.
Unified governance solution build into the lakehouse platform.
Provides auditing and data lineage capabilities.
Secure data sharing with delta sharing integrated directly into unity catalog. Delta sharing is sharing the data without copying from one place to other.
Existing tables and views can be upgraded to unity catalog.
Persona based services:
Data Engineer
Data Analyst
Data Scientist
It makes the specialized team working in silos to unfied team with shared responsibility.
Lack of schema enforcement leading inconsistent and low quality data.
Lack of integration with a data catalog. so there is no single truth of data.
Data is kept as a blob as immutable file. This leads to issues such as ineffective paritioning and too many small files.
Databricks solves these issues with two technologies.
Delta lake
File based open source storage format.
It supports ACID transaction.
Scalable data and metadata handling. It leverages spark to scale out all the metadata processing.
Audit history and time travel. It has a transaction log with details about every change to data and the ability to rollback to the earlier versions.
Schema enforcement and schema evolution: It prevents the insertion of data with the wrong schema while also allowing the table schema to be explicitly and safely changed.
Support for deletes, updates, and merges which is rare for a distributed processing framework to supports. This enabled the change data capture and slowly changed dimension and streaming upserts.
Unified streaming and batch data processing
Compatible with Apache Spark.
It uses delta tables based on Apache Parquet. It is a common format for a structuring data. So you can easily switch from existing parquet table to a delta table.
It provides fine-grained row, column and view level access control via SQL.
It provides audit train to understand who has performed what action against the data.
Built in data search and discovery.
Automated lineage for all workloads.
Delta sharing
Let’s say there is a delta lake table and we are sharing the data using delta sharing using delta sharing server. The delta sharing server has the access permissions. You can share the data without replicating it using the delta sharing protocol to a data recipient. The data recipient could be any tool like power bi, tableau, pandas, spark etc or any use case like BI, analytics, data science or any cloud like azure, aws, gcp or on-prem.