Databricks Serverless SQL is a robust, fully managed platform capable of powering next-generation analytics and data applications. It provides a unified platform, especially for collaborating with data analysts and data scientists.
Before delving into the serverless architecture, let’s first understand the control pane and compute pane, which is the operational foundation of Databricks.
- Control Pane: This encompasses all the backend services managed by Databricks within its Azure subscription. Additionally, it stores Databricks SQL queries, notebooks, and various other workspace configurations, all of which are encrypted at rest to enhance security.
- Compute Pane: It contains the various clusters of compute resources available to process the data.
Serverless Compute Compared with Classic Compute
The above image highlights the workings of the Serverless and Classic compute planes.
Features | Serverless compute plane | Classic compute plane |
Location of Compute Resources | The compute layer stays within the Databricks account |
The compute layer stays within the customer’s Azure |
Public IP address for nodes | No | No, if enabled with secure cluster connectivity otherwise there is public IP for each VM |
Unallocated VMs costs are covered by | Databricks | Customer pays Azure |
Data Protection | Operates within network boundary of workspace, bolstered by extra layers of security to ensure isolation | Natural isolation as compute stays in customer environment |
What are the primary hurdles we often encounter with the classic compute plane?
- Reduced Productivity: Analysts, data engineers, and data scientists are unproductive because they must wait for the cluster start-up.
- Always-on: Customers often maintain the warehouse as active with a fixed cluster to avoid start-up delays. However, this approach is inefficient and costly.
- Limited Scalability: Fixed Cluster sizes pose a bottleneck as data volume and processing needs grow. Dynamic scaling can be challenging and may lead to over or under-provisioning.
- Infrastructure Overhead: Require platform administrators to provision, configure, and manage clusters. However, this process can be time-consuming and error-prone, mainly due to the diverse user requirements and fluctuating workloads.
- Added Cost: The warehouse remains operational with fixed clusters even during low activity or no usage, leading to cost inefficiency.
- Resource Contention: Sharing the same cluster among multiple users or applications can lead to resource contention, resulting in poor performance or longer query execution time.
Why should we consider transitioning to serverless compute?
- Increased Productivity: Serverless SQL provides on-demand instant compute to perform SQL workloads. This approach is efficient as there is no wait time for cluster start-up.
- Improved Performance: Serverless compute incorporates optimization techniques like query caching, automatic indexing, and adaptive query execution, ultimately reducing latency and improving productivity.
- Auto Scaling: In serverless, scaling is automatic based on the workload. Customers need not be concerned about over- or under-provisioning compute resources.
- Managed Infrastructure: Databricks fully manages the infrastructure, handling tasks such as provisioning, scaling clusters, performance monitoring, and ensuring high availability.
- Reduced Cost: Customer only pay for the compute resources utilized in their workload. This approach is cost-efficient as running an idle cluster is unnecessary.
What are the essential requirements for implementing Serverless SQL?
- Databricks workspace must be the premium plan.
- The workspace must be in the supported region (Azure Databricks regions).
- If Azure storage data is used as a source and is protected by Azure Storage firewall, then additional configuration should be set to permit access from a Serverless compute node.
Guidelines to Implement Serverless SQL
Step 1: Create a Serverless warehouse
- Login to the workspace and click the SQL Warehouse option in the sidebar.
- Click Create SQL Warehouse and complete the following details.
- Name
- Cluster Size
- Auto Stop Time Limit
- Scaling Option
- Type Serverless
- The advanced option is optional with details such as tags and unity catalog (if enabled in the workspace).
- Provide access to the serverless SQL
- Search for the AD group or the user and add
- Select the level of permission required (can use and can manage)
- The successful creation of a Serverless SQL workspace will look as follows:
Step 2: Use the Serverless option for data analysis
- Click on the SQL Editor option in the sidebar
- In the New query window, select the serverless compute (users can see other options as well based on the permission)
- Analysts can start running their queries, and the access is instantaneous.
Step 3: Use Serverless SQL in Databricks Workflows
- Click on the Workflows in the sidebar.
- Click on Create Job and complete the following details
- Task Name
- Type – (Notebook, SQL, Python Script, etc.)
- Based on the type, options vary
- To start with, I have selected SQL as the type
- SQL task – (Query, Dashboard, Alert, File)
- SQL query
- SQL warehouse
- Other optional things are – Parameters, Notifications, Retries, and Duration threshold.
Serverless SQL does come with the following limitations:
- Policies: There is no support for establishing cluster policies or spot instance policies.
- Metastores: Using external hive metastores is not supported.
- Pubic IP: Serverless SQL has no public IP address and does not require or use secure cluster connectivity.
- Disk Storage: Disks in serverless SQL are short-lived and do not require customer-managed keys.
Conclusion
Databricks Serverless SQL provides an efficient, scalable, cost-effective, and reliable platform to perform data analysis and handle analytical workloads. By eliminating the need to manage infrastructure coupled with its robust security and governance features, it stands out as a compelling choice for data teams. The serverless plane outperforms the classic plane in various aspects, making it a desirable option for modern analytics.
AUTHOR - FOLLOW
Srinivasan Shanmugasundaram
Senior Manager, Data Engineering, Tredence Inc.
Topic Tags