Data Engineer Associate
Data Governance
1. The Four Areas of Data Governance
What is Data Governance?
A framework for managing data availability, usability, integrity, and security in an organization.
The Four Key Areas
-
Data Discovery
- Metadata management (e.g., tagging, searchable catalogs).
- Example: Unity Catalog’s searchable metadata.
-
Data Quality
- Ensures accuracy via validation rules (e.g.,
NOT NULL
constraints). - Example: Delta Lake schema enforcement.
- Ensures accuracy via validation rules (e.g.,
-
Data Security
- Access control (RBAC), encryption, masking.
- Example: Unity Catalog’s fine-grained permissions.
-
Data Lineage
- Tracks data flow from source to consumption.
- Example: Databricks Lineage Tracking in notebooks.
2. Metastores vs. Catalogs
Hive Metastore (Legacy)
- What: Stores table metadata (schema, location) for Spark SQL.
- Limitations:
- No centralized governance.
- Limited to workspace-level access.
Unity Catalog (Modern)
- What: A unified governance layer for data across workspaces/clouds.
- Benefits:
- Centralized permissions (RBAC).
- Three-level namespace (
catalog.schema.table
).
Feature | Hive Metastore | Unity Catalog |
---|---|---|
Scope | Workspace-local | Multi-workspace, multi-cloud |
Access Control | Basic (POSIX-like) | Fine-grained (RBAC) |
Lineage | Limited | Built-in |
3. Unity Catalog Securables
What are Securables?
Objects that can have permissions applied in Unity Catalog.
Key Securables
- Catalog: Top-level container (e.g.,
sales_catalog
). - Schema (Database): Logical grouping of tables (e.g.,
sales_catalog.europe
). - Table/View: Data objects (e.g.,
sales_catalog.europe.orders
). - Function: UDFs (e.g.,
sales_catalog.utils.discount
).
Permission Examples
4. Service Principals
What is a Service Principal?
A non-human identity (e.g., for automation) with controlled permissions.
Use Cases
- CI/CD pipelines.
- Automated jobs (e.g., nightly ETL).
How to Create
- Azure: Via Azure Active Directory (AAD).
- AWS: IAM roles.
- GCP: Service accounts.
Best Practice
- Use instead of personal accounts for production workflows.
5. Cluster Security Modes for Unity Catalog
Compatible Modes
-
Single User
- Runs as a single identity (user or service principal).
- Use Case: Jobs clusters.
-
No Isolation Shared
- Multiple users, but no data isolation.
- Use Case: All-purpose clusters with UC.
Incompatible Modes
- Shared with Table Access Control: Conflicts with UC’s RBAC.
6. Creating a UC-Enabled All-Purpose Cluster
Steps
- Navigate to Compute → Create Cluster.
- Select:
- Access Mode: “Single User” or “No Isolation Shared”.
- Unity Catalog: Attach a metastore.
- Set permissions:
Example Configuration
Setting | Value |
---|---|
Cluster Mode | No Isolation Shared |
Unity Catalog | Enabled |
Access Mode | Assigned to service principal |
7. Creating a Databricks SQL (DBSQL) Warehouse
What is a DBSQL Warehouse?
A serverless compute endpoint for SQL queries.
Steps
- Navigate to SQL → Warehouses → Create.
- Configure:
- Size: 2X-Small to 4X-Large.
- Auto-stop: 10 mins to 24 hours.
- Attach to Unity Catalog:
8. Querying a Three-Layer Namespace
Syntax
Example
Benefits
- Avoids naming conflicts (e.g.,
sales.transactions
vs.hr.transactions
).
9. Implementing Data Object Access Control
Steps
- Grant Permissions:
- Revoke Permissions:
Supported Privileges
SELECT
,MODIFY
,CREATE
,USAGE
,ALL PRIVILEGES
.
10. Best Practices
Colocate Metastores with Workspaces
- Why: Reduces latency and simplifies management.
- How: Deploy metastore in the same region as workspaces.
Use Service Principals for Connections
- Why: Avoid personal credentials in automation.
- Example:
Segregate Business Units by Catalog
- Why: Isolate data and permissions by department.
- Example:
marketing_catalog
finance_catalog
Summary Table: Key Concepts
Topic | Key Takeaway |
---|---|
Four Governance Areas | Discovery, Quality, Security, Lineage. |
Unity Catalog Securables | Catalogs, schemas, tables, functions. |
Service Principals | Non-human identities for automation. |
Three-Layer Namespace | catalog.schema.table for unified queries. |
Best Practices | Colocate metastores, use service principals, segregate catalogs by team. |