Previous to Unity Catalog, Databricks workspaces each had their own Hive metastore where the metadata was stored and accessible only from within the workspace. The objects in the workspace were constrained to that workspace.
One can argue that we can create external tables in each of the Databricks workspaces that point to common data in the datalake. Although the approach works, there still is a need to know where the data exists in a datalake (the exact path) and create an external table in the local metastore to make it available in the workspace. Further ACLs on objects defined at workspace level are restricted to that workspace object only and cannot be managed centrally
Let’s understand this with a practical example. Typically, we have different functions and divisions within an organization. Let’s use Marketing and Supply Chain as an example. Each of them has its own workspace so different teams can work and there is a separation of development and deployment and objects between the functions. Possibly both have their own datalake accounts as well.
Both these functions need internal sales data to derive business-specific KPIs. If the Marketing function has built pipelines to pull data from the source and cleanse it, the Supply chain function may not know that such an object exists and even if it does there are additional complications of creating a mount point to the Marketing datalake and giving the right level of access through ACLs. This is a challenge.
Databricks Unity Catalog enables the creation of a centralized metastore where objects are stored and accessible across all workspaces that are attached to this centralized metastore solution.
The methods to access objects in Unity Catalog are different as well. Before Unity Catalog, there was a two-level structure of database and object name to access data.
However, with Unity Catalog we have 3 levels to access the object i.e., catalog, database and objects, creating a path not only for easier access but additional governance capability.
Today’s capability allows users to create a catalog by division/function and organize objects within it by creating multiple databases under it.
Additionally, assuming there is browse access on all objects (use catalog/use schema) users are able to see all objects without having access to the data.
Great. So now we know what Unity Catalog capabilities are, but how can we make our workspaces (new and existing) attach to Unity Catalog? Databricks has delivered a new solution to centralize users, groups, and service principals, attaching the workspaces to the Unity Catalog and assigning access to the workspace through the account console.
Now this is Significant for Two Reasons
- You can have all your users, groups, and service principals synced up to the centralized account console through a SCIM connector.
- Unity Catalog creates governance to the attached workspaces to the metastore and assigns users, groups, principals with access to the workspace from the account console which previously had to be managed per workspace and brought up a significant administration overhead.
Summarizing Unity Catalog
- Central metastore to store all Databricks objects
- Ability to attach multiple workspaces to central metastore
- A three-level hierarchy of objects and access can be provided at any of the levels of the hierarchy
- A centralized account console to sync up users, groups and principals from Identity management solution
- Ability to attach workspaces to Unity Catalog and assign users, groups, and principals from the central account console
The Above Features Enable the Following Benefits
- Centralized governance for data
- Built-in data search and discovery
- Fine-grained access control of data at catalog, database and object level
- Automated lineage of objects, notebooks, workflows
Now that you know what Unity Catalog and its benefits are, it is time to deep dive into the setup of Unity Catalog.
So, stay tuned for the next chapter, to learn how we can organize objects in the Unity Catalog.
AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, Databricks Champion
Topic Tags