Last year, around May 2023, Databricks launched a much-awaited centralized operational data store, quickly garnered all the attention for the right reasons. While everyone was excited to finally capture the end-to-end entity relationship for all objects – tables, notebooks, workflows, DLT pipelines, dashboards, and models- the auditing capabilities truly made it a backend store for monitoring all the operation aspects. Now, we can track every action a user performs on the objects mentioned above from inside the workspace or outside. That is the bigger breakthrough for creating a near-perfect Zero Trust Architecture (ZTA).
Before we delve deep into ZTA, let us first understand the overall capabilities of system tables in a bit of depth. As we write, below tables1 are available today in the “system” catalog:
Category | Description | Table Name |
Audit logs | Captures all audit events from workspaces. | system.access.audit |
Clusters | Captures full history of configuration changes over time for any cluster. | system.compute.clusters |
SQL warehouse events | Captures events related to SQL warehouses. | system.compute.warehouse_events |
Billable usage | Captures DBU consumption across your account. | system.billing.usage |
Pricing | Captures historical log of SKU pricing. | system.billing.list_prices |
Table lineage | Captures read or write events on a Unity Catalog table or path. | system.access.table_lineage |
Column lineage | Captures read or write events on a column | system.access.column_lineage |
Node types | Captures the currently available node types with their basic hardware information. | system.compute.node_types |
Marketplace funnel events | Includes consumer impression and funnel data for your listings. | system.marketplace.listing_ funnel_events |
Marketplace listing access | Includes consumer info for completed request data or get data events on your listings. | system.marketplace.listing_ access_events |
Predictive optimization | Tracks the operation history of the predictive optimization feature. | system.storage.predictive_ optimization_operations_history |
For this blog, we will restrict our analysis to the five tables below, which serve as the base for operational auditing.
- system.access.audit
- system.compute.clusters
- system.compute.warehouse_events
- system.billing.usage
- system.billing.list_prices
Operational Auditing
The “Audit” table in the “access” schema has the entire universe of operations logged at the workspace level. Any interaction with any object is tracked in this table. As of today, Databricks tracks 409 distinct actions (like login, downloads, command execution etc.) across thirty-five unique services (like unity catalog, DBSQL, notebooks, jobs, etc.) Let us look at some interesting analysis this table can help with.
Tracking sensitive data usage in the past 90 days using table name
1. From Job
select event_time, user_identity.email,action_name, "Jobs" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='jobs' and action_name="runCommand"
and request_params["commandText"] like '%<>%'
2. From Notebook
select event_time, user_identity.email,action_name, "Notebook" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='notebook' and action_name="runCommand"
and request_params["commandText"] like '%<>%'
3. From DBSQL
select event_time, user_identity.email,action_name, "Databricks SQL" as Entity, request_params["jobId"] as job_id
from system.access.audit where event_date > now() - interval '90 day' and service_name='databrickssql' and action_name="commandSubmit"
and request_params["commandText"] like '%<>%'
4. From Catalog Explorer
select event_time, user_identity.email,action_name, "Unity Catalog Explorer" as Entity, request_params["jobId"] as job_id
from `system`.access.audit
where event_date > now() - interval '90 day' and
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] = '<>'
Standard output –
Tracking sensitive data usage in the past 90 days using tags
If the tables or columns have tags information which can classify them as sensitive, then it makes monitoring your entire lakehouse even simpler.
Example: If a table has been marked with the tag “sensitive:yes” or a column with a tag “pii:yes” then this tag information can be used to see who has accessed all such tables by modifying the previous queries slightly, as shown below.
select action_name,event_time, user_identity.email,
from `system`.access.audit
where event_date > now() - interval '90 day' and
action_name IN ('createTable','getTable','deleteTable')
and request_params["full_name_arg"] IN (
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.column_tags where tag_name = "pii" and tag_value ="yes"
UNION
select concat_ws('.',catalog_name,schema_name,table_name) from system.information_schema.table_tags where tag_name = "sensitive" and tag_value ="yes"
)
Currently, there is no built-in way to automatically tag tables/columns by scanning the sample data like few governance tools available in the market. However, there are open-source libraries like Presidio or Databricks GenAI model “ai_classify” that automate this programmatically.
Tracking privileged users/groups
More often than not, we provide users/groups some elevated access to support a POC or use case for interim. However, those get missed tracking in the long run, which results in some users having more privileged access than needed. There is a very easy way to track that at all securable object levels – metastore, catalog, schema, table, external location, storage credentials, volumes, and Delta shares.
Sample query to track privileges across all catalogs and schema and sort them by users/group having maximum privileges –
SELECT grantee, 'catalog' AS securable_type, count(*) AS total FROM system.information_schema.catalog_privileges GROUP BY 1, 2
union ALL
SELECT grantee, 'schema' AS securable_type, count(*) AS total FROM system.information_schema.schema_privileges GROUP BY 1, 2
order by 2,3 desc
[ This can be further drilled down to exact privileges by using the column ‘privilege_type’ ]
Tracking unauthorized logins per user
SELECT event_date, ifnull(user_identity.email, request_params.user) AS email, collect_set(action_name) AS action_names, collect_set(response.error_message) AS error_messages, collect_set(response.status_code) AS response_codes, count(*) AS total FROM system.access.audit
WHERE action_name IN ('aadBrowserLogin', 'aadTokenLogin', 'certLogin', 'jwtLogin', 'login', 'oidcBrowserLogin', 'samlLogin', 'tokenLogin') AND response.status_code IN (401, 403)
GROUP BY 1,2 ORDER BY total DESC
Tracking unauthorized data access requests
select event_date,user_identity.email,source_ip_address, response.error_message,audit_level,action_name FROM system.access.audit WHERE service_name = 'unityCatalog' AND response.status_code IN (401, 403)
Tracking hard-coded passwords written in notebooks using a regex pattern.
select event_time, user_identity.email,service_name, request_params.commandLanguage as Language, request_params.commandText as command, request_params.notebookId as Notebook_id
from system.access.audit where action_name="runCommand"
and request_params["commandText"] rlike '(?|(password)|(pswd)|(psswd))'
Tracking downloads from the workspace
SELECT event_date, user_identity.email, collect_set(service_name) AS service_names, collect_set(action_name) AS action_names, count(*) AS total FROM system.access.audit WHERE action_name IN ('downloadPreviewResults', 'downloadLargeResults', 'filesGet', 'getModelVersionDownloadUri', 'getModelVersionSignedDownloadUri') OR (action_name = 'workspaceExport') OR (action_name = 'downloadQueryResult') GROUP BY 1, 2
Billing Monitoring
“Billing” schema has two tables as of now– “usage” and “list_prices.”
Usage information gives vital information to track your consumption pattern for compute. Illustrative examples are shown below.
Track scheduled jobs cost
select
YearMonth, sku_name, sum(usage_quantity) as Total_DBU_Consumed ,sum(list_cost) as Total_List_Cost
from
(
select
usage.sku_name,
date_format(usage_date, 'yyyy-MM') as YearMonth,
usage_quantity,
list_prices.pricing.default * usage_quantity as list_cost
from
system.billing.usage
left join system.billing.list_prices on usage.sku_name = list_prices.sku_name
where
usage_metadata.job_id is not Null)
group by YearMonth, sku_name
Track scheduled jobs cost on all-purpose compute
Interactive (All Purpose) compute is meant to be used during the development cycle. Once a solution is developed, it should be moved to job clusters as best practice to reduce costs. The below query can help estimate the cost incurred by jobs running on interactive clusters.
with created_jobs as (
select
get_json_object(response.result, '$.job_id') as job_id
from
system.access.audit
where
action_name = 'create' and service_name = 'jobs' and response.status_code = 200
),
deleted_jobs as (
select request_params.job_id
from system.access.audit
where
action_name = 'delete' and service_name = 'jobs' and response.status_code = 200
)
select sku_name, sum(dbus) as total_dbus, sum(list_cost) as total_list_cost from
(select a.sku_name, d.pricing.default,
a.usage_quantity as dbus,
a.usage_quantity * d.pricing.default as list_cost
from
system.billing.usage a
left join
(select * from created_jobs b where b.job_id not in (select job_id from deleted_jobs )) active_jobs
on a.usage_metadata.job_id = active_jobs.job_id
left join system.billing.list_prices d on a.sku_name = d.sku_name
WHERE contains(a.sku_name, 'ALL_PURPOSE')
)GROUP by all
-- left join deleted_jobs c on b.workspace_id = c.workspace_id
-- and b.job_id = c.job_id
DBU pricing
To view the standard rates offered across all regions and for all SKUs, namely all-purpose cluster, job cluster, SQL warehouses, DLT, and model training, you can use the below query and substitute the SKU you are interested in.
SELECT price_start_timesku_name, currency_code, pricing.default as cost FROM system.billing.list_prices where sku_name = "PREMIUM_JOBS_COMPUTE" and price_end_time IS NULL
You can also track the historical price changes using the price_start_time and price_end_time columns.
Be aware that the data does not reflect the discounts applied on top of your account. Hoping to see that Databricks comes up with a unique way of addressing this in the future.
Feature to look out for
While system tables have created a good momentum for auditing, monitoring and lineage capabilities, we are curious to see if Databricks can bring some features around :
- Tracking lineage information of hive_metastore based tables.
- Automated classifier for sensitive data in table at table and column level
- Tracking the pricing as per the discounts applied on customer’s subscription.
- Tracking Delta Live Tables clusters
References:
AUTHOR - FOLLOW
Priyanshu Kumar
Senior Manager, Data Engineering
Topic Tags