Data Governance

Introduction

  • A framework of policies, processes, roles, and technical controls that ensures your organization's data is secure, trustworthy, and used responsibly throughout its lifecycle

  • Access control and security: Implementing fine-grained permissions and security measures to protect data from unauthorized access while enabling appropriate use.

  • Data lineage and observability: Tracking data flows and transformations to understand data origins, dependencies, and usage patterns.

  • Data quality management: Ensuring data is accurate, complete, consistent, and reliable for decision-making and analytics.

  • Metadata management: Capturing and maintaining information about data assets to improve discoverability and understanding.

  • Compliance enforcement: Meeting regulatory requirements and organizational policies for data privacy, retention, and usage.

Unity Catalog

  • A centralized data catalog that provides governance for both structured and unstructured data in multiple formats. It offers fine-grained access control and governance of AI assets such as machine learning models.

  • A metastore is a top-level container for data objects in Databricks.

  • A catalog is a grouping of databases/schemas within a metastore.

  • Unity Catalog introduces a new model where one metastore can contain multiple catalogs.

Access Control

Layers of access control

Access control in Unity Catalog is built on the following complementary models:

  • Workspace-level restrictions control where users can access data, by limiting objects to specific workspaces.

  • Privileges and ownership control who can access what, using grants on securable objects.

  • Attribute-based policies (ABAC) control what data users can access, using governed tags and centralized policies.

  • Table-level filtering and masking control what data users can see within tables using table-specific filters and views.

These models work together to enforce secure, fine-grained access across your data environment.

Workspace-level restrictions

Limit which workspaces can access specific catalogs, external locations, and storage credentials

Workspace-level bindings

Privileges and ownership

Control access to catalogs, schemas, tables, and other objects

Privilege grants to users and groups, object ownership

Attribute-based policies

Use tags and policies to dynamically apply filters and masks

ABAC policies and governed tags

Table-level filtering and masking

Control what data users can see within tables

Row filters, column masks, dynamic views

Securable Object

  • A securable object is an object defined in the Unity Catalog meta store on which privileges can be granted to a principal

Privilege types

  • The permission that can be granted to each object

Principal

  • 3 types of identities or principals which are users, service principals and groups.

  • Users are individual physical users which are uniquely identified by their email addresses. A user can have an admin role to perform several administrative tasks important to unity catalog, such as managing and assigning meta stores to workspaces and managing other users.

  • Service principal is an individual identity for use with automated tools and applications. It is uniquely identified by application ID.

  • Groups that collects users and service principals into a single entity. Groups can be nested with other groups. For example, a parent group called employees can contain two inner groups HR and finance groups.

Command

  • Show granted permission on the object

  • Grant permission on the object

Delta Sharing

  • Secure data sharing with other organizations regardless of the computing platforms they use.

  • 2 ways to share data using Delta Sharing - Databricks to Databricks Sharing and Databricks

    Open Sharing Protocol

  • Databricks to Databricks Sharing allows you to share data between Databricks clients who use Unity Catalog. This does not only support sharing tables, but also sharing views, volumes, and even notebooks.

  • Databricks Open Sharing Protocol lets you share data that you manage in Unity catalog with users who don't use Databricks.

Lakehouse Federation

  • allows users and applications to run queries across diverse data sources—such as data lakes, warehouses, and databases—without requiring the physical migration of data into Databricks. This reduces data duplication and streamlines access, enabling a unified query experience across distributed environments.

Best Practice

1. Identity Management (The "Who")

  • Account-Level Identities: All users, groups, and service principals must be defined at the Account level (not workspace level) to be used in Unity Catalog.

  • SCIM Provisioning: Use SCIM to sync identities from your Identity Provider (IdP) directly to the Databricks account.

  • Group-Based Access: Avoid granting permissions to individual users. Instead, create groups in your IdP and assign permissions to those groups.

  • Service Principals for Jobs: Always use service principals for automated production workflows to prevent job failure if an individual leaves the company.

2. Privilege & Access Control

  • The Power of Inheritance: Use the Catalog > Schema > Table hierarchy. Permissions granted at the Catalog level automatically flow down to all Schemas and Tables within it.

  • "USE" vs. "SELECT": Remember that USE CATALOG and USE SCHEMA are prerequisites. A user cannot query a table unless they have USE permissions on the parent catalog and schema, even if they have SELECT on the table itself.

  • BROWSE Privilege: Grant the BROWSE privilege to "All Users" at the catalog level to allow data discovery (viewing metadata) without giving them actual access to the data until requested.

  • Limit Admin Roles: Be very sparing with the metastore admin and ALL PRIVILEGES roles. Assign object ownership to groups rather than individuals.

3. Data Organization (Catalogs & Schemas)

  • Organizational Alignment: Create catalogs that reflect your business units, teams, or environments (e.g., prod_marketing, dev_finance).

  • Medallion Layers: Use schemas within those catalogs to represent your data stages (e.g., bronze, silver, gold).

  • Workspace Binding: For strict isolation, bind specific catalogs to specific workspaces (e.g., a "Finance" catalog only visible in the "Finance" workspace).

4. Storage & Table Types

  • Prioritize Managed Tables: Use Managed Tables (Delta or Iceberg) by default. They offer the best performance (auto-compaction, metadata caching) and full lifecycle management by Unity Catalog.

  • External Tables for Migration: Only use External Tables if you must keep the data in its original cloud location or if it needs to be accessed by non-Databricks tools.

  • External Locations: Limit who can create EXTERNAL LOCATIONS. These should be handled by admins to connect cloud storage (S3/ADLS) to Unity Catalog securely. Never mount these locations to DBFS.

5. Governance Features

  • Audit Logging: Ensure account-level audit logging is enabled to track every access and permission change within the Unity Catalog.

  • Information Schema: Use the information_schema within each catalog to programmatically audit permissions and data lineage.

  • Managed Volumes: Use Volumes for non-tabular data (PDFs, images, machine learning models) so they can be governed with the same ACLs as your tables.

Last updated