Databricks for business leaders - Notes
Azure Databricks
- Fully managed, cloud based data analytics platform
- built on apache spark
- Provisioned as an Azure resource
- standard tier
- premium tier
- Trial
Key concepts
- Apache spark clusters - provides hightly scalable parallel compute for distributed data processing
- Data bricks File Stystem - provides distributed shared storage for data lakes
- Notebooks - provide an interactive environment for combinging code, notes, documentation and images.
- Metastore - provides an relational abstration layer, enabling you to define tables based on data in files.
- Delta lkae - builds on the metastore to enable common relation database capabilities (e.g., ACID compliance, DML etc)
- SQL Warehouses provide relational compute end points for querying data in tables.
Difference between Azure Synapse Analytics and Azure Databricks
- Databricks services users one compute (spark)
- Azure databricks uses a Data lakehouse architecture to work with data.
- Synapse uses two
%
sign for the magic command whereas databricks uses a single%
sign. - Display function is specific to databricks. It is not available in Synapse.
- Databricks uses optimized spark wrapper around the open source Apache Spark framework.
- Medallion architecture
- DBFS loads the data lake on to the compute cluster for access to the files. This can be linked to databricks SQL warehouse.
- In Azure databricks, SQL warehouse is equivalent to Synapse analytics’ lake database in serverless SQL pool rather than the data warehouse in the dedicated SQl pool. Because the data is not stored in the relational storage but rather the files in DBFS file system.