Building a Zero Trust Architecture with System Tables on Unity Catalog

Data Engineering

Date : 06/25/2024

Data Engineering

Date : 06/25/2024

Building a Zero Trust Architecture with System Tables on Unity Catalog

Learn how to build a Zero Trust Architecture using System Tables on Unity Catalog. Includes operational auditing, tracking sensitive data, and monitoring billing.

Priyanshu Kumar

AUTHOR - FOLLOW
Priyanshu Kumar
Senior Manager, Data Engineering

Building a Zero Trust Architecture with System Tables on Unity Catalog

Like the blog

Table of contents

Building a Zero Trust Architecture with System Tables on Unity Catalog

Operational Auditing
Billing Monitoring
Feature to look out for

Like the blog

Table of contents

Building a Zero Trust Architecture with System Tables on Unity Catalog

Operational Auditing
Billing Monitoring
Feature to look out for

Building a Zero Trust Architecture with System Tables on Unity Catalog

Last year, around May 2023, Databricks launched a much-awaited centralized operational data store, quickly garnered all the attention for the right reasons. While everyone was excited to finally capture the end-to-end entity relationship for all objects – tables, notebooks, workflows, DLT pipelines, dashboards, and models- the auditing capabilities truly made it a backend store for monitoring all the operation aspects. Now, we can track every action a user performs on the objects mentioned above from inside the workspace or outside. That is the bigger breakthrough for creating a near-perfect Zero Trust Architecture (ZTA).

Before we delve deep into ZTA, let us first understand the overall capabilities of system tables in a bit of depth. As we write, below tables1 are available today in the “system” catalog:

Category	Description	Table Name
Audit logs	Captures all audit events from workspaces.	system.access.audit
Clusters	Captures full history of configuration changes over time for any cluster.	system.compute.clusters
SQL warehouse events	Captures events related to SQL warehouses.	system.compute.warehouse_events
Billable usage	Captures DBU consumption across your account.	system.billing.usage
Pricing	Captures historical log of SKU pricing.	system.billing.list_prices
Table lineage	Captures read or write events on a Unity Catalog table or path.	system.access.table_lineage
Column lineage	Captures read or write events on a column	system.access.column_lineage
Node types	Captures the currently available node types with their basic hardware information.	system.compute.node_types
Marketplace funnel events	Includes consumer impression and funnel data for your listings.	system.marketplace.listing_ funnel_events
Marketplace listing access	Includes consumer info for completed request data or get data events on your listings.	system.marketplace.listing_ access_events
Predictive optimization	Tracks the operation history of the predictive optimization feature.	system.storage.predictive_ optimization_operations_history

For this blog, we will restrict our analysis to the five tables below, which serve as the base for operational auditing.

system.access.audit
system.compute.clusters
system.compute.warehouse_events
system.billing.usage
system.billing.list_prices

Operational Auditing

The “Audit” table in the “access” schema has the entire universe of operations logged at the workspace level. Any interaction with any object is tracked in this table. As of today, Databricks tracks 409 distinct actions (like login, downloads, command execution etc.) across thirty-five unique services (like unity catalog, DBSQL, notebooks, jobs, etc.) Let us look at some interesting analysis this table can help with.

Tracking sensitive data usage in the past 90 days using table name

1. From Job


select event_time, user_identity.email,action_name, "Jobs" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='jobs' and action_name="runCommand"
and request_params["commandText"] like '%<>%'

2. From Notebook


select event_time, user_identity.email,action_name, "Notebook" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='notebook' and action_name="runCommand"
and request_params["commandText"] like '%<>%'

3. From DBSQL


select event_time, user_identity.email,action_name, "Databricks SQL" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='databrickssql' and action_name="commandSubmit"
and request_params["commandText"] like '%<>%'

4. From Catalog Explorer


select event_time, user_identity.email,action_name, "Unity Catalog Explorer" as Entity, request_params["jobId"] as job_id
from `system`.access.audit
where event_date > now() - interval '90 day' and 
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] =  '<>'

Standard output –

Catalog Explorer

Tracking sensitive data usage in the past 90 days using tags

If the tables or columns have tags information which can classify them as sensitive, then it makes monitoring your entire lakehouse even simpler.

Example: If a table has been marked with the tag “sensitive:yes” or a column with a tag “pii:yes” then this tag information can be used to see who has accessed all such tables by modifying the previous queries slightly, as shown below.


select  action_name,event_time, user_identity.email,
from `system`.access.audit
where event_date > now() - interval '90 day' and 
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] IN (
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.column_tags where tag_name = "pii" and tag_value ="yes"
UNION
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.table_tags where tag_name = "sensitive" and tag_value ="yes"
)

tables tags information

Currently, there is no built-in way to automatically tag tables/columns by scanning the sample data like few governance tools available in the market. However, there are open-source libraries like Presidio or Databricks GenAI model “ai_classify” that automate this programmatically.

Tracking privileged users/groups

More often than not, we provide users/groups some elevated access to support a POC or use case for interim. However, those get missed tracking in the long run, which results in some users having more privileged access than needed. There is a very easy way to track that at all securable object levels – metastore, catalog, schema, table, external location, storage credentials, volumes, and Delta shares.

Sample query to track privileges across all catalogs and schema and sort them by users/group having maximum privileges –


SELECT grantee, 'catalog' AS securable_type, count(*) AS total FROM system.information_schema.catalog_privileges GROUP BY 1, 2
union ALL
SELECT grantee, 'schema' AS securable_type, count(*) AS total FROM system.information_schema.schema_privileges GROUP BY 1, 2
order by 2,3 desc

[ This can be further drilled down to exact privileges by using the column ‘privilege_type’ ]

Tracking unauthorized logins per user


SELECT event_date, ifnull(user_identity.email, request_params.user) AS email,  collect_set(action_name) AS action_names, collect_set(response.error_message) AS error_messages, collect_set(response.status_code) AS response_codes, count(*) AS total FROM system.access.audit 
WHERE action_name IN ('aadBrowserLogin', 'aadTokenLogin', 'certLogin', 'jwtLogin', 'login', 'oidcBrowserLogin', 'samlLogin', 'tokenLogin') AND response.status_code IN (401, 403) 
GROUP BY 1,2 ORDER BY total DESC

Tracking privileged

Tracking unauthorized data access requests


select event_date,user_identity.email,source_ip_address, response.error_message,audit_level,action_name  FROM system.access.audit WHERE service_name = 'unityCatalog' AND response.status_code IN (401, 403)

tracking unauthorized data access

Tracking hard-coded passwords written in notebooks using a regex pattern.


select event_time, user_identity.email,service_name, request_params.commandLanguage as Language, request_params.commandText as command, request_params.notebookId as Notebook_id
from system.access.audit where action_name="runCommand"
and request_params["commandText"] rlike '(?|(password)|(pswd)|(psswd))'

Tracking downloads from the workspace


SELECT event_date, user_identity.email, collect_set(service_name) AS service_names, collect_set(action_name) AS action_names, count(*) AS total FROM system.access.audit WHERE action_name IN ('downloadPreviewResults', 'downloadLargeResults', 'filesGet', 'getModelVersionDownloadUri', 'getModelVersionSignedDownloadUri') OR (action_name = 'workspaceExport') OR (action_name = 'downloadQueryResult') GROUP BY 1, 2

Billing Monitoring

“Billing” schema has two tables as of now– “usage” and “list_prices.”

Usage information gives vital information to track your consumption pattern for compute. Illustrative examples are shown below.

Track scheduled jobs cost


select 
YearMonth, sku_name, sum(usage_quantity) as Total_DBU_Consumed ,sum(list_cost) as Total_List_Cost
from
(
  select
    usage.sku_name,
    date_format(usage_date, 'yyyy-MM') as YearMonth,
    usage_quantity,
    list_prices.pricing.default * usage_quantity as list_cost
  from
    system.billing.usage
    left join system.billing.list_prices on usage.sku_name = list_prices.sku_name
  where
    usage_metadata.job_id is not Null)
group by YearMonth, sku_name

track scheduled jobs cost

Track scheduled jobs cost on all-purpose compute

Interactive (All Purpose) compute is meant to be used during the development cycle. Once a solution is developed, it should be moved to job clusters as best practice to reduce costs. The below query can help estimate the cost incurred by jobs running on interactive clusters.


  with created_jobs as (
  select
    get_json_object(response.result, '$.job_id') as job_id
  from
    system.access.audit
  where
    action_name = 'create' and service_name = 'jobs' and response.status_code = 200
),
deleted_jobs as (
  select request_params.job_id
  from system.access.audit
  where
    action_name = 'delete' and service_name = 'jobs' and response.status_code = 200
)
select sku_name, sum(dbus) as total_dbus, sum(list_cost) as total_list_cost from
(select a.sku_name, d.pricing.default,
a.usage_quantity as dbus,
a.usage_quantity * d.pricing.default as list_cost
from
system.billing.usage a
left join
(select * from created_jobs b where b.job_id not in (select job_id from deleted_jobs )) active_jobs
on a.usage_metadata.job_id = active_jobs.job_id
left join system.billing.list_prices d on a.sku_name = d.sku_name
WHERE contains(a.sku_name, 'ALL_PURPOSE')
)GROUP by all
    -- left join deleted_jobs c on b.workspace_id = c.workspace_id
  -- and b.job_id = c.job_id

track scheduled jobs cost

DBU pricing

To view the standard rates offered across all regions and for all SKUs, namely all-purpose cluster, job cluster, SQL warehouses, DLT, and model training, you can use the below query and substitute the SKU you are interested in.


SELECT price_start_timesku_name, currency_code, pricing.default as cost FROM system.billing.list_prices where sku_name = "PREMIUM_JOBS_COMPUTE" and price_end_time IS NULL

DBU pricing

You can also track the historical price changes using the price_start_time and price_end_time columns.

Be aware that the data does not reflect the discounts applied on top of your account. Hoping to see that Databricks comes up with a unique way of addressing this in the future.

Feature to look out for

While system tables have created a good momentum for auditing, monitoring and lineage capabilities, we are curious to see if Databricks can bring some features around :

Tracking lineage information of hive_metastore based tables.
Automated classifier for sensitive data in table at table and column level
Tracking the pricing as per the discounts applied on customer’s subscription.
Tracking Delta Live Tables clusters

References:

https://docs.databricks.com/en/admin/system-tables/index.html

Priyanshu Kumar

AUTHOR - FOLLOW
Priyanshu Kumar
Senior Manager, Data Engineering

Topic Tags

Databricks Unity Catalog

Centralized Data Store

Data Lineage

Billing Monitoring

Security Tracking

Predictive Optimization

System Tables

Next Topic

4 Ways Generative AI is Transforming the CPG Industry

Continue reading

Next Topic

4 Ways Generative AI is Transforming the CPG Industry

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Iceberg-Powered Data Mesh on Snowflake

Blog

Iceberg-Powered Data Mesh on Snowflake

How Databricks, SAP Business Data Cloud and Tredence Unlock AI-Driven Innovation

Blog

How Databricks, SAP Business Data Cloud and Tredence Unlock AI-Driven Innovation

Business Intelligence with Databricks SQL - Explained

Blog

Business Intelligence with Databricks SQL - Explained

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.