Streamline Your Data Processes with Databricks SQL Serverless

The Business of Intelligence: Streamlining Data Processes with Databricks SQL Serverless

Our first blog explored how companies can leverage Databricks SQL Serverless to integrate all data, analytics, and AI processes in a data lakehouse while enhancing the user experience, performance, and costs. In this blog, we will explore the unique features of the SQL data warehouse that enable teams to do more with their data.

Using Databricks SQL Serverless provides teams with multiple efficiencies, such as streamlined integration with popular business intelligence tools, scalable configurations, and simplified cluster selection. However, teams must also optimize data models to get the most out of query performance. Users will want to follow commonly used data modeling best practices like star schema and aggregations and optimize data structures to filter large datasets efficiently, ensuring that only the necessary data is processed.

Databricks offers several optimizations for data structures:

Table partitioning: Similar to database or Hive partitioning, table partitioning divides a data folder into multiple subfolders based on a selected partitioning key. However, partitioning should be limited to large tables (over 1 TB) with partitions around 1 GB. Ideally, the number of partitions should not exceed 10,000, and partitions should be of relatively uniform size. Small tables do not benefit from partitioning and may experience performance degradation.
Partitioning allows SQL queries to skip partitions when the where clause does not match the partition filter criteria. However, including the partition key in the filter criteria is still essential.
Optimize and Z-Order commands: With incremental data ETL and small volume workloads, numerous small files can be created, which are not optimal for reading. The optimize command in Databricks rewrites data files for Delta tables, compacting files to around 1 GB (the recommended size for optimized reads). The Z-order command can then be used to physically sort data by specified columns, enabling faster reads by skipping data files when filter criteria match Z-ordered columns.
Liquid clustering: Databricks recommends liquid clustering for all new Delta tables, as it offers a more flexible approach than partitioning and Z-ordering. Liquid clustering allows for the dynamic redefinition of clustering keys without rewriting existing data, enabling data layout to adapt as analytical needs evolve.

Examples of use cases for liquid clustering include tables that are frequently filtered by high-cardinality columns, have skewed data distribution, proliferate and require ongoing maintenance, have concurrent write requirements, have changing access patterns over time, and ones where traditional partitioning would lead to too many or too few partitions.

Governance gets granular:

Using Databricks SQL Warehouse for access controls, auditability and more

Ensuring effective governance is becoming even more critical as teams use more data for analytics and reporting, increasing risks. Governance in Databricks SQL Warehouse involves controlling cluster usage and data access through Unity Catalog and Databricks account groups.

Unity Catalog is a centralized solution for access control, auditing, lineage, and data discovery across Databricks workspaces. It enables fine-grained governance for data and AI on the Databricks platform. Access control in Unity Catalog and to clusters is managed through Databricks groups, which can be created by the Databricks administrator or synced with an identity provider via SCIM provisioning.

Best practices for managing groups include:

Syncing groups from an identity provider to Databricks, using SCIM provisioning to streamline user management.
Assigning these groups access to Databricks workspaces and compute resources, ensuring only authorized users can utilize them.
Enabling data access within Unity Catalog by setting permissions at the catalog, schema, and object levels, empowering only authorized users to access sensitive data.

With SQL Warehouse, any user in a group with appropriate access can explore data using SQL queries or a BI tool. Unity Catalog also provides advanced governance capabilities, such as logging metadata on SQL query executions, allowing administrators to track which users accessed specific objects and executed specific queries.

Tap these tools to democratize data and analytics use

Databricks is expanding its capabilities in analytics to provide an end-to-end ecosystem for data engineering, AI/ML, and analytics. The primary analytics tools available in Databricks include AI/BI dashboards, AI/BI Genie, and Databricks Apps.

Components include:

Data tab: Allows users to define datasets in the dashboard. These datasets are packaged with the dashboard when sharing, importing, or exporting through the UI or API.
Canvas tab: Organized into multi-page reports, the Canvas tab allows dashboard editors to build and configure dashboards using widgets like visualizations, filters, text, and images.

Key features include:

Visualization: Add visualization widgets to the canvas, with options like area, bar, box, combo, counter, heatmap, histogram, line, pie, pivot, scatter, and table charts.
Text widgets: Use markdown to format text in plain text editors, insert links, and add images to dashboards.
Cross-filtering: Explore a selected subset of data and interactively examine relationships and patterns across multiple visualizations.
Filters: Narrow results and refine data in visualizations, like slicers in traditional BI tools.
Publish dashboard: Publish a clean, finalized copy of a dashboard. Any authorized user with access can view the published version.

Databricks AI/BI Genie is part of the Databricks Lakehouse Platform. It is designed to streamline access to AI and BI insights directly from data lakes. Genie facilitates the translation of natural language queries into SQL, enabling users to derive meaningful insights without extensive SQL knowledge.

While Generative AI (GenAI) capabilities handle unstructured data well, challenges arise when teams work with structured data. That’s due to the need to understand data context and select appropriate tables, columns, joins, and aggregations to ensure that natural language questions accurately translate to SQL queries relevant to the data lake’s schema. AI/BI Genie addresses this requirement by pulling relevant names and descriptions from annotated tables and columns to convert natural language questions into precise SQL queries, returning the generated query and the results table when possible.

Core features include:

Instructions for context: By applying domain-specific instructions, Genie can better understand industry-specific jargon, logic, and unique concepts. This “prompt engineering” provides a seamless and intuitive user experience.
Parameterized SQL queries: Allows for dynamic queries, enhancing flexibility in data exploration.
Trusted assets: Marks certain assets as “trusted,” giving users an added layer of confidence in the accuracy and relevance of results.
Chat history and response structuring: Genie includes features like chat history, natural language explanations of results, and a results table. Users can also provide feedback using upvote and downvote buttons.

Databricks continuously enhances AI/BI Genie with new features, making it one of the most exciting developments within the Databricks ecosystem.

Databricks Apps is a recently introduced feature that enables users to build, deploy, and manage applications directly on the Databricks Lakehouse Platform. This capability provides an environment for creating custom, interactive applications that leverage data integrated within Databricks.

Previously, building data and AI applications with data managed by Databricks required deploying additional infrastructure, ensuring governance and compliance, and managing application security, such as authentication and authorization. Now, Databricks Apps allows Databricks to host applications directly, eliminating the need for separate infrastructure configuration and deployment.

Databricks Apps is tailored for developers who can build applications in Python using frameworks like Dash, Streamlit, or Gradio. Developers can work locally using an integrated development environment (IDE), such as Visual Studio Code, run applications locally, and then seamlessly move them to the Databricks workspace for production.

Azure, Activated: How Power BI and Databricks integrate smoothly

Databricks offers a new feature that enhances its integration with Power BI on the Azure cloud. It supports seamless catalog integration and data model synchronization, allowing teams to publish datasets directly to Power BI Service without leaving the Databricks UI. The “Publish to Power BI” functionality transfers individual tables and synchronizes entire schemas, including table relationships.

If a user has built a well-defined semantic layer in Databricks Unity Catalog with primary and foreign key relationships, these relationships are preserved in Power BI. This reduces the effort required to recreate or manage relationships between tables in Power BI.

Solution Benefits:

Enables direct integration: Publish datasets to Power BI directly from within Databricks.
Streamlines connection management: Automatically manage connections, eliminating the need for manual setup.
Is always up-to-date: Automatically synchronize changes to underlying tables or relationships with Power BI.
Provides a single source of truth: Entity relationships and objects defined in Unity Catalog are directly published to Power BI.

Migrations, Mastered:
How leading enterprises are harnessing Databricks capabilities

Here’s how two companies migrated reports from a competitor solution to Databricks and their results.

A large beverage bottler gains clarity, control and cost reductions by migrating to Databricks and Power BI

Challenge: A leading bottling company used a modern, in-memory database technology but experienced scalability challenges. The company managed 30 terabytes of data across five business domains and 300 billion records.

The company sought to transition to a data lake architecture with Power BI as a reporting tool. The business users were accustomed to the prior tool's real-time, fast-query responses, and the company wanted to preserve this experience. However, simply adding multiple premium Power BI licenses was cost-prohibitive.

Solution: The bottler worked with partners to migrate financial reporting to a data lake, building out the bronze, silver, gold, and aggregate layers in the lakehouse architecture. The partners implemented a composite model in Power BI, combining summary data imported into Power BI with detailed data stored in the data lake for drill-downs. The detailed data was accessed through Databricks SQL Warehouse serverless computing, with large t-shirt-sized clusters scaled up to 10 nodes.

Where users access summarized reports, Power BI uses in-memory data for reporting. On drilling down to detailed data, the query switched to Power BI direct query mode to report data from Databricks delta tables using SQL Serverless compute on Databricks.
When business scenarios require large-scale distinct counts (which can’t fit into Power BI’s memory), Databricks processes this data on the fly in real-time via Power BI’s direct query mode. This adds approximately four seconds of latency compared to import mode, which was considered acceptable by the customer.

Business Impact:

The bottling company benefits with:

A scalable solution that easily handles data volume increases
20,000 business users served
70% overall solution cost reductions
Improved service level agreement (SLA) levels, as only a fraction of the data is imported into Power BI, reducing memory and processing demands
Deeper insights due to the integration with external data in the data lake

Company with $50B in annual revenues transfers tables: Using a single solution for data, analytics and AI

Challenge: A large global enterprise used AWS for data processing and a separate cloud data storage solution for report hosting. After being processed in the AWS data lake, some data was transferred to the cloud data storage solution for reporting purposes. However, the transfer process used most of the storage solution’s compute resources, negatively impacting performance. The company wanted to migrate to Databricks Lakehouse to reduce costs and improve reporting SLAs.

Solution: The company worked with partners to consolidate its data processing and report hosting environment into the Databricks Lakehouse. Key steps included:

Streamlining migration: Objects and code from the competitor solution were “lifted and shifted” to Databricks SQL Notebooks.
Optimizing data organization: Partitioning was implemented for very large tables (over 1 TB), while Liquid Clustering was used for other tables to enhance data organization and query efficiency.
Setting up compute: Multiple Databricks SQL Warehouse computes (per data product) were set up using medium and large capacity clusters.
Ensuring a seamless transition: The Snowflake architecture for catalogs, schemas, and tables was lifted and shifted to Databricks, ensuring that reports were unaffected. The only required change was updating the connection strings.

Business Impact

The company achieved a:

50% reduction in compute costs due to more granular compute selection
30% improvement in data availability SLAs
15-20% reporting performance improvement by using Databricks SQL Warehouse compute
No impact on end-user reporting during the migration, as only connection strings were updated

Accelerate analytics results with Databricks

With the launch of Databricks SQL Serverless, Databricks has solidified its leadership role in Data Analytics and its leadership in data engineering and AI/ML. Databricks has become the go-to platform for building end-to-end analytics solutions directly within its environment. This unified approach allows organizations to consolidate on a single platform while ensuring comprehensive governance and data lineage. Companies also benefit by driving performance, reducing latency, and increasing the cost efficiency of processes while gaining a scalable solution that will grow with their business

Want to learn more about Databricks SQL Serverless? Check out these references.

Dashboards | Databricks on AWS

What is Databricks Apps? | Databricks on AWS

Work with an AI/BI Genie space | Databricks on AWS

Announcing General Availability: Publish to Microsoft Power BI Service from Unity Catalog | Databricks Blog

AUTHOR - FOLLOW
Maulik Divakar Dixit
Director, Data Engineering, <br>Databricks Champion<br>Databricks MVP

Next Topic

A Complete Guide to Creating an Actionable Data Migration Checklist