Databricks SQL Serverless: End-to-End Analytics on Data Lakes

From Data Engineering to Visualization: How Databricks Now Enables End-to-end Analytics Processes

Databricks has long been a preferred tool for data engineering and data science workloads. More than 60% of the Fortune 500 use Databricks use Databricks Data Intelligence Platform to unify and democratize data, analytics, and artificial intelligence (AI).

With the introduction of Databricks SQL Serverless, Databricks now enables end-to-end data and analytics processes, from data engineering to business intelligence (BI) and reporting. Previously, most organizations built cloud databases on top of Databricks, using solutions such as Redshift, Snowflake, or Synapse Analytics to load processed data from Databricks and support BI reporting tools like Power BI and Tableau. Databricks could not directly support BI reporting applications querying its data lake without creating significant reporting delays or performance bottlenecks.

Today, organizations leverage Databricks for data engineering, machine learning, generative AI use cases, and direct analytics and reporting using Databricks compute. They gain flexible, scalable architecture and tools that enable teams to integrate diverse data types, enable real-time data streaming, and develop predictive insights based on more of their data. Databricks SQL Serverless and the Databricks Lakehouse platform provide one source of truth for all organizational data, drive world-class analytics performance, and offer data lake economics.

Challenge accepted: How Databricks SQL Serverless overcomes traditional lakehouse analytics roadblocks.

Databricks SQL Serverless offers a premier experience for BI and SQL workloads directly on data lakes. Built on the Databricks Lakehouse Platform, it is designed to manage SQL workloads with exceptional scalability and cost-efficiency.

Here’s how Databricks SQL Serverless addresses traditional lakehouse analytics challenges:

Delivering a better user experience: Running BI tools on data lakes used to present challenges, such as creating excessive latency and performance challenges. For example, using Parquet files to enable real-time reporting was not optimal as the compute did not support efficient querying of parquet file. As a result, solutions like Synapse Analytics, Snowflake, and other databases were introduced which were meant for large-scale analytical reporting. This proliferation of tools meant business users needed to understand and manage more tools.

SQL Serverless is compatible with a wide range of BI tools, enabling users to leverage their favorite tools to query the freshest data for analytics, visualization, and reporting.

Configuring the tool for data exploration and BI reporting is simple and user-friendly. Instead of configuring complex settings such as machine type, central processing unit (CPU), random access memory (RAM), timeouts, and the number of worker and driver nodes, business analysts select from “t-shirt sizes” ranging from 2X-small to 4X-large and specify the number of clusters to scale based on workload demands dynamically. This simplified approach to cluster management, with built-in security features, enables greater user autonomy.

Intuitive connections simplify BI tool integration

For example, clicking on Python provides out-of-the-box code, which can be pasted into a Python application to connect directly to SQL warehouses.

Improving performance: In the past, compute engines lacked the optimizations necessary to enable efficient querying of data stored in the data lake.

Databricks SQL Serverless is optimized for SQL workloads. It has an open data format and enables seamless integration with data lakes and the Databricks Lakehouse platform, providing direct connectivity and enabling a unified analytics platform. It offers lightning-fast performance — 2.2X faster than the previous world record in 100TB TPC-DS, the gold-standard performance benchmark for data warehousing—and 2.7X faster than a key competitor’s solution.

Reducing costs: Running BI tools on available cloud compute has historically been expensive because costs are driven by compute capacity and runtime. Moreover, the need to copy data into cloud databases and perform merges to keep data current further increases the overall solution cost.

Databricks SQL Serverless is engineered for cost-efficient performance. It provides a single version of truth and automatically scale for high concurrency needs. In the test described above, Databricks delivered a price performance that was 12X better than that of a key competitor. Databricks SQL Serverless also provides a pay-as-you-go model that helps manage and reduce costs.

Ensuring true scalability: While cloud databases are inherently scalable, scalability is limited when using external tables to access data lake files. These compute engines are optimized for data stored natively within the database, so simply increasing compute resources doesn’t lead to linear improvements in query performance on delta tables unless the data is stored within the database itself.

Databricks SQL Serverless offers on-demand, serverless compute, instantly provisioning compute resources that scale up and down with workload requirements and efficiently queries delta format without a need to migrate data to cloud databases.

Enhancing governance: Implementing separate solutions for data engineering in Databricks and analytics on cloud databases creates multiple points where sensitive data security needs to be managed. For instance, in Azure, complex access control lists and carefully designed folder structures were required to control access to sensitive data. This complexity increased as fine-grained security controls became necessary to secure data across systems.

Databricks enables built-in Governance on objects through the use of the Unity Catalog. Databricks SQL Serverless works with Unity Catalog objects and adheres to the Unity Catalog permission model.

The power of ultra-fast processing: Why Photon Engine is a game-changer

Databricks Serverless SQL includes Photon Engine, an advanced query engine built to improve performance and efficiency in SQL-based analytics, particularly in a serverless environment.

Photon leverages modern cloud hardware, including multi-core CPUs and single instruction, multiple data (SIMD) capabilities. SIMD enables the parallel processing of multiple data points within a single CPU instruction, significantly speeding up complex calculations and aggregate operations.

Optimized for analytical and online analytical processing (OLAP) workloads, Photon is well-suited for handling large datasets, complex joins aggregations, and window functions, reducing query latency. Photon’s SQL query optimization, autoscaling, and the ability to start clusters within five seconds after issuing queries make Databricks SQL Serverless ideal for BI reporting, analytics, and data exploration. Users can directly query Databricks for reporting needs: BI tools connect to the SQL data warehouse and execute live queries, which is particularly advantageous for high-volume data and on-the-fly aggregations.

Learn more about Databricks SQL Serverless

Databricks SQL Serverless is generally available on AWS and Azure. Learn more here.

Stay tuned for our next blog on key features that make this data warehouse powerful – and real-world applications of what your industry peers are achieving with it.