Data Governance
Introduction
A framework of policies, processes, roles, and technical controls that ensures your organization's data is secure, trustworthy, and used responsibly throughout its lifecycle
Access control and security: Implementing fine-grained permissions and security measures to protect data from unauthorized access while enabling appropriate use.
Data lineage and observability: Tracking data flows and transformations to understand data origins, dependencies, and usage patterns.
Data quality management: Ensuring data is accurate, complete, consistent, and reliable for decision-making and analytics.
Metadata management: Capturing and maintaining information about data assets to improve discoverability and understanding.
Compliance enforcement: Meeting regulatory requirements and organizational policies for data privacy, retention, and usage.
Unity Catalog

A centralized data catalog that provides governance for both structured and unstructured data in multiple formats. It offers fine-grained access control and governance of AI assets such as machine learning models.
A metastore is a top-level container for data objects in Databricks.
A catalog is a grouping of databases/schemas within a metastore.
Unity Catalog introduces a new model where one metastore can contain multiple catalogs.
Access Control
Layers of access control
Access control in Unity Catalog is built on the following complementary models:
Workspace-level restrictions control where users can access data, by limiting objects to specific workspaces.
Privileges and ownership control who can access what, using grants on securable objects.
Attribute-based policies (ABAC) control what data users can access, using governed tags and centralized policies.
Table-level filtering and masking control what data users can see within tables using table-specific filters and views.
These models work together to enforce secure, fine-grained access across your data environment.
Workspace-level restrictions
Limit which workspaces can access specific catalogs, external locations, and storage credentials
Workspace-level bindings
Privileges and ownership
Control access to catalogs, schemas, tables, and other objects
Privilege grants to users and groups, object ownership
Attribute-based policies
Use tags and policies to dynamically apply filters and masks
ABAC policies and governed tags
Table-level filtering and masking
Control what data users can see within tables
Row filters, column masks, dynamic views
Securable Object

A securable object is an object defined in the Unity Catalog meta store on which privileges can be granted to a principal
Privilege types
The permission that can be granted to each object
Principal
3 types of identities or principals which are users, service principals and groups.
Users are individual physical users which are uniquely identified by their email addresses. A user can have an admin role to perform several administrative tasks important to unity catalog, such as managing and assigning meta stores to workspaces and managing other users.
Service principal is an individual identity for use with automated tools and applications. It is uniquely identified by application ID.
Groups that collects users and service principals into a single entity. Groups can be nested with other groups. For example, a parent group called employees can contain two inner groups HR and finance groups.
Command

Show granted permission on the object
Grant permission on the object
Delta Sharing

Secure data sharing with other organizations regardless of the computing platforms they use.
2 ways to share data using Delta Sharing - Databricks to Databricks Sharing and Databricks
Open Sharing Protocol
Databricks to Databricks Sharing allows you to share data between Databricks clients who use Unity Catalog. This does not only support sharing tables, but also sharing views, volumes, and even notebooks.
Databricks Open Sharing Protocol lets you share data that you manage in Unity catalog with users who don't use Databricks.
Lakehouse Federation
allows users and applications to run queries across diverse data sources—such as data lakes, warehouses, and databases—without requiring the physical migration of data into Databricks. This reduces data duplication and streamlines access, enabling a unified query experience across distributed environments.
Best Practice
1. Identity Management (The "Who")
Account-Level Identities: All users, groups, and service principals must be defined at the Account level (not workspace level) to be used in Unity Catalog.
SCIM Provisioning: Use SCIM to sync identities from your Identity Provider (IdP) directly to the Databricks account.
Group-Based Access: Avoid granting permissions to individual users. Instead, create groups in your IdP and assign permissions to those groups.
Service Principals for Jobs: Always use service principals for automated production workflows to prevent job failure if an individual leaves the company.
2. Privilege & Access Control
The Power of Inheritance: Use the
Catalog > Schema > Tablehierarchy. Permissions granted at the Catalog level automatically flow down to all Schemas and Tables within it."USE" vs. "SELECT": Remember that
USE CATALOGandUSE SCHEMAare prerequisites. A user cannot query a table unless they haveUSEpermissions on the parent catalog and schema, even if they haveSELECTon the table itself.BROWSE Privilege: Grant the
BROWSEprivilege to "All Users" at the catalog level to allow data discovery (viewing metadata) without giving them actual access to the data until requested.Limit Admin Roles: Be very sparing with the
metastore adminandALL PRIVILEGESroles. Assign object ownership to groups rather than individuals.
3. Data Organization (Catalogs & Schemas)
Organizational Alignment: Create catalogs that reflect your business units, teams, or environments (e.g.,
prod_marketing,dev_finance).Medallion Layers: Use schemas within those catalogs to represent your data stages (e.g.,
bronze,silver,gold).Workspace Binding: For strict isolation, bind specific catalogs to specific workspaces (e.g., a "Finance" catalog only visible in the "Finance" workspace).
4. Storage & Table Types
Prioritize Managed Tables: Use Managed Tables (Delta or Iceberg) by default. They offer the best performance (auto-compaction, metadata caching) and full lifecycle management by Unity Catalog.
External Tables for Migration: Only use External Tables if you must keep the data in its original cloud location or if it needs to be accessed by non-Databricks tools.
External Locations: Limit who can create
EXTERNAL LOCATIONS. These should be handled by admins to connect cloud storage (S3/ADLS) to Unity Catalog securely. Never mount these locations to DBFS.
5. Governance Features
Audit Logging: Ensure account-level audit logging is enabled to track every access and permission change within the Unity Catalog.
Information Schema: Use the
information_schemawithin each catalog to programmatically audit permissions and data lineage.Managed Volumes: Use Volumes for non-tabular data (PDFs, images, machine learning models) so they can be governed with the same ACLs as your tables.
Last updated