1. The Four Areas of Data Governance

What is Data Governance?

A framework for managing data availability, usability, integrity, and security in an organization.

The Four Key Areas

  1. Data Discovery

    • Metadata management (e.g., tagging, searchable catalogs).
    • Example: Unity Catalog’s searchable metadata.
  2. Data Quality

    • Ensures accuracy via validation rules (e.g., NOT NULL constraints).
    • Example: Delta Lake schema enforcement.
  3. Data Security

    • Access control (RBAC), encryption, masking.
    • Example: Unity Catalog’s fine-grained permissions.
  4. Data Lineage

    • Tracks data flow from source to consumption.
    • Example: Databricks Lineage Tracking in notebooks.

2. Metastores vs. Catalogs

Hive Metastore (Legacy)

  • What: Stores table metadata (schema, location) for Spark SQL.
  • Limitations:
    • No centralized governance.
    • Limited to workspace-level access.

Unity Catalog (Modern)

  • What: A unified governance layer for data across workspaces/clouds.
  • Benefits:
    • Centralized permissions (RBAC).
    • Three-level namespace (catalog.schema.table).
FeatureHive MetastoreUnity Catalog
ScopeWorkspace-localMulti-workspace, multi-cloud
Access ControlBasic (POSIX-like)Fine-grained (RBAC)
LineageLimitedBuilt-in

3. Unity Catalog Securables

What are Securables?

Objects that can have permissions applied in Unity Catalog.

Key Securables

  1. Catalog: Top-level container (e.g., sales_catalog).
  2. Schema (Database): Logical grouping of tables (e.g., sales_catalog.europe).
  3. Table/View: Data objects (e.g., sales_catalog.europe.orders).
  4. Function: UDFs (e.g., sales_catalog.utils.discount).

Permission Examples

GRANT SELECT ON TABLE sales_catalog.europe.orders TO analyst_group;
GRANT USAGE ON SCHEMA sales_catalog.europe TO marketing_team;

4. Service Principals

What is a Service Principal?

A non-human identity (e.g., for automation) with controlled permissions.

Use Cases

  • CI/CD pipelines.
  • Automated jobs (e.g., nightly ETL).

How to Create

  1. Azure: Via Azure Active Directory (AAD).
  2. AWS: IAM roles.
  3. GCP: Service accounts.

Best Practice

  • Use instead of personal accounts for production workflows.

5. Cluster Security Modes for Unity Catalog

Compatible Modes

  1. Single User

    • Runs as a single identity (user or service principal).
    • Use Case: Jobs clusters.
  2. No Isolation Shared

    • Multiple users, but no data isolation.
    • Use Case: All-purpose clusters with UC.

Incompatible Modes

  • Shared with Table Access Control: Conflicts with UC’s RBAC.

6. Creating a UC-Enabled All-Purpose Cluster

Steps

  1. Navigate to ComputeCreate Cluster.
  2. Select:
    • Access Mode: “Single User” or “No Isolation Shared”.
    • Unity Catalog: Attach a metastore.
  3. Set permissions:
    GRANT ALL PRIVILEGES ON CATALOG sales_catalog TO cluster_user;
    

Example Configuration

SettingValue
Cluster ModeNo Isolation Shared
Unity CatalogEnabled
Access ModeAssigned to service principal

7. Creating a Databricks SQL (DBSQL) Warehouse

What is a DBSQL Warehouse?

A serverless compute endpoint for SQL queries.

Steps

  1. Navigate to SQLWarehousesCreate.
  2. Configure:
    • Size: 2X-Small to 4X-Large.
    • Auto-stop: 10 mins to 24 hours.
  3. Attach to Unity Catalog:
    GRANT USE_WAREHOUSE ON WAREHOUSE analytics_wh TO bi_team;
    

8. Querying a Three-Layer Namespace

Syntax

SELECT * FROM catalog.schema.table;

Example

-- Query a table in 'prod' catalog, 'finance' schema
SELECT * FROM prod.finance.transactions;

Benefits

  • Avoids naming conflicts (e.g., sales.transactions vs. hr.transactions).

9. Implementing Data Object Access Control

Steps

  1. Grant Permissions:
    GRANT SELECT ON TABLE prod.finance.transactions TO finance_team;
    
  2. Revoke Permissions:
    REVOKE SELECT ON TABLE prod.finance.transactions FROM contractor;
    

Supported Privileges

  • SELECT, MODIFY, CREATE, USAGE, ALL PRIVILEGES.

10. Best Practices

Colocate Metastores with Workspaces

  • Why: Reduces latency and simplifies management.
  • How: Deploy metastore in the same region as workspaces.

Use Service Principals for Connections

  • Why: Avoid personal credentials in automation.
  • Example:
    spark.read.table("prod.sales.orders")  # Authenticates via service principal
    

Segregate Business Units by Catalog

  • Why: Isolate data and permissions by department.
  • Example:
    • marketing_catalog
    • finance_catalog

Summary Table: Key Concepts

TopicKey Takeaway
Four Governance AreasDiscovery, Quality, Security, Lineage.
Unity Catalog SecurablesCatalogs, schemas, tables, functions.
Service PrincipalsNon-human identities for automation.
Three-Layer Namespacecatalog.schema.table for unified queries.
Best PracticesColocate metastores, use service principals, segregate catalogs by team.