Databricks UC Managed Tables: Performance & Governance

UC Managed Tables in Databricks: Performance, Governance & Secure Sharing

UC Managed Tables are Databricks’ flagship table format. In the recently concluded Technical Kickoff (TKO), databricks revealed some of the coolest features of UC Managed table and future roadmap. Below are a few highlights of the same.

What is UC Managed Table?

Databricks supports different types of tables as highlighted below:

Foreign tables:

These are tables created in foreign catalogs like in Hive Meta store or Glue and federated with databricks.
Databricks can’t write to these tables, and they are read only. They can be read or written by Third Party clients though
They don’t use native databricks benefits like automated performance optimization or faster ETL/ELT
The only benefit is they can be governed within UC and are discoverable and accessible all within one place

External tables:

These are created within databricks with a set location
Its file system-based catalog source of truth so anyone having access to the file system will be able to bypass the governance.
Databricks can read data from and write data into the table
But no automated performance optimization

Managed tables:

Created within databricks without specifying any location
These are created within the storage locations in customer-managed cloud storage
Reads and writes are feasible within databricks and from Third Party clients as well
The biggest benefit it gets is automated performance optimization powered by data intelligence platforms.

Key Benefits

UC governs access for all clients
Storage location is being managed by UC but within Customer managed location.
Allows reads and writes using Third Party clients using open APIs via single copy of data within Unity Catalog
Automated optimization Powered by Databricks Data intelligence platform.
Continuous improvement in Databricks Roadmap.

Deep Dive into Features

UC Managed tables are stored in Open formats such as Parquet, Delta, Iceberg with metadata on the top
They can be accessed by different clients via open catalog APIs (UC Rest APIs, Iceberg REST Catalog APIs)
They can be shared across organizations using Delta Sharing via single copy being maintained within UC.

Delta Sharing in Action

Ref: What is Delta Sharing? — Azure Databricks | Microsoft Learn

With Delta sharing managed table data can be shared with Third Party clients using Secure and reliable Delta Sharing Protocol using single copy being managed and governed by UC.

As shown in the diagram, the objects which can be shared are

Tables
AI Models
Files
Functions
Views

Let’s now dive into a simple example of how to create and consume the delta share securely.

Step 1: Enable Delta Sharing (If Not Already Enabled)

In Databricks, enable Delta Sharing in the workspace settings.

Step 2: Create a Delta Share

CREATE SHARE my_data_share;

Step 3: Add a Delta Table to the Share

^{ALTER SHARE my_data_share}
^{ADD TABLE my_catalog.my_schema.my_table;}

Step 4: Grant Access to a Recipient

^{CREATE RECIPIENT my_client}
^{WITH TOKEN 'client_generated_token';}
^{GRANT SELECT ON SHARE my_data_share TO RECIPIENT my_client;}

Access Delta Share (Python) — Consumer Side — Data Receiver

Install Delta Sharing Client

^{pip install delta-sharing}

Read Data from Delta Share (using Pyspark)

^{from pyspark.sql import SparkSession}

^{spark = SparkSession.builder \}
^{.appName("DeltaSharingConsumer") \}
^{.getOrCreate()}

^{df = spark.read.format("deltaSharing") \}
^{.load(profile_file + "#share_name.schema_name.table_name")}

^df.show()

Open API Integration Example (Product Pricing Comparison Use Case)

Delta Lake tables can also be accessed via Unity REST APIs
Iceberg tables can be accessed using Iceberg REST APIs
Both of these with single copy of data within UC. Clients authenticate themselves to UC. They use short-lived storage credentials created only for themselves to access only the table which is being shared.
UC Managed Tables allows third party writes which is one of its kind.

Here are the steps for accessing external data via a generic OpenAPI (REST API)

Install Required Library (if not available)

%pip install requests

Use Case: Real-Time Product Pricing with External API Integration

Business Objective:

An e-commerce company wants to:

Pull real-time competitor pricing data from a third-party pricing intelligence API.
Combine it with their internal product catalog.
Recommend competitive pricing using business rules or ML models.
Push results to dashboards or downstream systems.

Generic OpenAPI Request Example in Databricks: example mock pricing API

^{GET https://api.pricingintel.com/v1/competitor-price?product_id=12345}

Returns:

^{
^{"product_id": "12345",}
^{"competitor_prices": [}
^{{"store": "Amazon", "price": 99.99},}
^{{"store": "Walmart", "price": 97.49},}
^{{"store": "Target", "price": 98.75}}
^]
^}

Step-by-Step Implementation in Databricks

Step 1: Simulate Internal Product Catalog :

^{from pyspark.sql import SparkSession}

^{# Sample internal product catalog}
^{products = [}
^{{"product_id": "12345", "product_name": "Wireless Mouse", "our_price": 100.00},}
^{{"product_id": "67890", "product_name": "Bluetooth Headphones", "our_price": 150.00},}
^]

^{df_products = spark.createDataFrame(products)}
^{df_products.display()}

Step 2: Call External Pricing API for Each Product

import pandas as pd

^{# Convert Spark to Pandas for row-wise API call}
^{product_list = df_products.toPandas()}
^{pricing_records = []}

^{for _, row in product_list.iterrows():}
^{ext_data = fetch_competitor_prices(row["product_id"])}
^{for competitor in ext_data["competitor_prices"]:}
^{pricing_records.append({}
^{"product_id": row["product_id"],}
^{"product_name": row["product_name"],}
^{"our_price": row["our_price"],}
^{"competitor": competitor["store"],}
^{"competitor_price": competitor["price"]}
^})

^{# Create Spark DataFrame from combined results}
^{df_pricing = spark.createDataFrame(pd.DataFrame(pricing_records))}
^{df_pricing.display()}

Step 3: Combine External and Internal Data:

^{import pandas as pd}

^{# Convert Spark to Pandas for row-wise API call}
^{product_list = df_products.toPandas()}
^{pricing_records = []}

^{for _, row in product_list.iterrows():}
^{ext_data = fetch_competitor_prices(row["product_id"])}
^{for competitor in ext_data["competitor_prices"]:}
^{pricing_records.append({}
^{"product_id": row["product_id"],}
^{"product_name": row["product_name"],}
^{"our_price": row["our_price"],}
^{"competitor": competitor["store"],}
^{"competitor_price": competitor["price"]}
^})

^{# Create Spark DataFrame from combined results}
^{df_pricing = spark.createDataFrame(pd.DataFrame(pricing_records))}
^{df_pricing.display()}

Step 4: Analyze & Recommend Pricing Strategy

^{from pyspark.sql.functions import avg, col}

^{# Calculate average competitor price}
^{df_avg_price = df_pricing.groupBy("product_id", "product_name", "our_price") \}
^{.agg(avg("competitor_price").alias("avg_competitor_price")) \}
^{.withColumn("recommended_price", col("avg_competitor_price") - 0.50) # Undercut by 50 cents}

^{df_avg_price.display()}

Step 5: Push Results to Downstream Dashboard or Table

^{# Save results as Delta Table}
^{df_avg_price.write.format("delta").mode("overwrite").saveAsTable("recommended_product_pricing")}

Secure API Key Using Databricks Secrets:

API_KEY = dbutils.secrets.get(scope="openapi-secrets", key="pricing-api-key")

Performance Optimization (Predictive, Statistics, Liquid Clustering)

Predictive Optimization:

Predictive optimization continuously learns in the backend with the following metrices

Telemetry — which gives information about what is the table, most popular query, columns used in the query, common predicates being used
Once telemetry data is collected then model evaluation takes place to improve performance.
Once model evaluation is done then it runs optimization using serverless infrastructure

Below is a pictorial representation of the entire cycle how predictive optimization works.

Supported optimizations (GA)

Compaction — to merge the small files into one to reduce scanning
Clustering for new data which is coming in
VACUUM to clear the unneeded files from older versions.

Automated Statistics (public preview)

Without Automated Statistics, challenges being faced are

User must manually run the ANALYZE for query optimization stats
Users also need to ensure columns which needed data skipping are part of first 32 columns

With automatic statistics

Query optimization stats are collected and maintained automatically
And with automated intelligent collection of data stats on required columns, clients should see a 33% improvement in query execution.

Liquid Clustering

Delta Lake liquid clustering replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance

Liquid clustering provides flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time
Liquid clustering applies to both Streaming Tables and Materialized Views.
It’s widely adopted with 3000+ customers using Liquid clustering monthly
170PB+ Data are written into liquid tables monthly
There are 2PB+largest active liquid tables at present
It’s available for both managed and external tables
Much easier to use in UC Managed Table

With external tables there must be explicit mention of columns to be clustered by

Ex: CREATE TABLE TBL1 CLUSTER BY date, product_id

Whereas for UC Managed Table, automatic liquid clustering is available

CREATE TABLE TBL1 CLUSTER BY AUTO

This will take care of

Clustering key selection
Clustering new data on write
Background clustering
No need to select which columns to be clustered with as it will be automatically taken care of based on access pattern

Low Latency Queries (GA)

Cache Delta Metadata for UC manages tables
Reduce expensive cloud IOs

Migration Path & Myths Debunked

Myth

Customer objection: External tables are more open since I can point directly to the storage path and Managed tables are more closed as their paths are GUIDs

Fact

Path-based access is insecure since access is given directly to the path bypassing the governance
Error prone. Since the same path if used by 2 different tables can corrupt the data in case the tables are being loaded at the same time which might corrupt the data.
On the other hand, UC Managed Mables are completely open in terms of secure access via Delta sharing, Open APIs etc.
The path for UC Managed Table is being managed within customer storage accounts with only a single copy of data being governed by UC.

Do you want to know how to migrate External tables to UC Managed Table to get all indigenous benefits?

Databricks has a new capability in Private Preview which allows seamless migration from external to Managed table with minimal to zero downtime

Run a single alter command as below

^{ALTER TABLE catalog.schema.external_table SET MANAGED}

Characteristics:

No data movement or duplication
Path-based code from DBR continues to work just that it will be automatically redirected to UC metadata
Minimal downtime (<1min) for creation of delta log creation in UC managed location

AUTHOR - FOLLOW
Shakti Prasad Mohapatra
Senior Solutions Architect

Next Topic

Azure Databricks Lakeflow: The Ultimate Guide to Effortless Data Orchestration

Next Topic

UC Managed Tables in Databricks: Performance, Governance & Secure Sharing

Like the blog

Table of contents

Like the blog

Table of contents

What is UC Managed Table?

Key Benefits

Deep Dive into Features

Access Delta Share (Python) — Consumer Side — Data Receiver

Open API Integration Example (Product Pricing Comparison Use Case)

Here are the steps for accessing external data via a generic OpenAPI (REST API)

Use Case: Real-Time Product Pricing with External API Integration

Step-by-Step Implementation in Databricks

Performance Optimization (Predictive, Statistics, Liquid Clustering)

Predictive Optimization:

Predictive optimization continuously learns in the backend with the following metrices

Supported optimizations (GA)

Automated Statistics (public preview)

Liquid Clustering

Low Latency Queries (GA)

Migration Path & Myths Debunked

Myth

Fact

Do you want to know how to migrate External tables to UC Managed Table to get all indigenous benefits?

Characteristics:

Azure Databricks Lakeflow: The Ultimate Guide to Effortless Data Orchestration

Azure Databricks Lakeflow: The Ultimate Guide to Effortless Data Orchestration

recommended articles

Thank you for a like!

Share this article

Industries

Services

Solutions

Blogs

Data & AI 101

Client Success

Life at Tredence

Careers

Contact us

CSR Framework

Certifications

Follow us on