Unity Catalog workspace offers 3 types of clusters to run data pipelines in Databricks
- Singler User : This is the most secure way of accessing data in Databricks with unity catalog enabled workspace. Single user is a cluster designed to be used by a single user. The permission that the user has with respect to external locations and files work with a single user cluster.
- Shared : This is the cluster type that is shared across users. This works on unity catalog. This cluster has some limitations that are explained later in the blog.
- No Isolation Shared : This cluster type does work only with legacy hive metastore to enable legacy data access and processing for objects in local hive metastore. The permissions set on the unity objects do not work in this case.x
Please see details in the link -
Create a cluster - Azure Databricks | Microsoft Learn
Challenges and Limitations with Shared Clusters
From a Unity Catalog point of view, the single user and shared cluster modes are the only cluster modes that can be used.
The shared cluster mode has some limitations which are important to understand and cases where a single user cluster helps work around the limitations.
Below are a few limitations that we faced with shared cluster mode
- Dataframe created on external location in unity catalog and then trying to create a view gives error that user does not have SELECT permission on file
- Trying to save file to external location using dataframe.write.partitionBy(*partition_by).format("delta").save(delta_file_path) causes failures
- Function like input_file_name() which basically tags the source filename against each records gave null results
- Storing json files to datalake using json.dumps() did not work
- UDF and fernet encryption were not working with shared cluster mode
- File system commands like %fs ls does not work
These limitations were found when interacting with external locations and datalake. To overcome it we switched to a single user cluster mode initially to create the external delta tables. Once we were able to create Delta Tables (Bronze tables) we were able to use shared cluster mode for building Silver and Gold layers.
So for the project we did the following:
- For the data ingestion team we created a personal cluster for each member with a very small compute and let them use the personal cluster.
- We also used a single user cluster to orchestrate our jobs from Azure Data Factory and Databricks Workflows by creating an application user and making the user part of the metastore admin group.
- To meet SLAs for small jobs we used interactive clusters. Since interactive cluster does not support service principal as a user, we created a generic user and used that to run the jobs in a single user mode.
The future Databricks roadmap plan we understand is to introduce new cluster types that can remove limitations imposed by the shared cluster mode. However, the single user cluster in production with orchestrated data pipelines works without any issues
This is a short blog but thought it was important to bring out and explain the cluster types and what works in what scenario and how we can work around them.
Watch this space to learn about Unity Catalog's data governance capabilities in Chapter 5 of this blog series.
AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion
Topic Tags