Unleashing the Power of Serverless SQL: Migrating to Databricks with DBT

Data Engineering

Date : 05/13/2024

Data Engineering

Date : 05/13/2024

Unleashing the Power of Serverless SQL: Migrating to Databricks with DBT

Explore the integration of DBT and Databricks Serverless SQL to streamline your data workflows. Discover how to leverage serverless environments for cost-effective, scalable data processing.

Pragathi Sharma

AUTHOR - FOLLOW
Pragathi Sharma
Manager, Data Engineering, Tredence Inc.

Unleashing the Power of Serverless SQL: Migrating to Databricks with DBT
Like the blog
Unleashing the Power of Serverless SQL: Migrating to Databricks with DBT

In the ever-evolving landscape of data engineering and analytics, several trends are reshaping how organizations handle their information assets. Key shifts include recognizing data as a strategic asset at the C-level and emphasizing cost-efficient systems. The concept of a lakehouse, which combines the best of data lakes and data warehouses, has gained market validation. It offers scalability, flexibility, and analytics capabilities while maintaining data governance.

However, the true transformation occurs with the shift to Databricks—a platform that has recently empowered organizations with serverless SQL capabilities. Databricks Serverless SQL clusters and the DBT (Data Build Tool) play pivotal roles, and here, we will learn how you can harness their power to create efficient, scalable, and cost-effective data pipelines.

DBT changed how transformations were performed earlier by relying heavily on data engineers to build data pipelines or workflows with several notebooks. It now allows data analysts and engineers with SQL skills to leverage software engineering best practices, from developing, documenting, testing, and deploying to the destination with ease. 

The relationship between DBT and Databricks has a history, dating back to the 2020 Data and AI Summit(Europe), where discussions revolved around integrating DBT with Databricks and Delta. Recent improvements have enhanced simplicity, user-friendliness, and a better experience, resulting in faster production-grade outcomes and scalability. 

Combining the power of DBT with a Serverless SQL warehouse for near real-time streaming

Ever since SQL warehouses were introduced in Databricks, there have been some evident query performances with DBT pushing SQL commands as they are optimized specially for the SQL compared to regular compute in Databricks. With the recent growth of Databricks Serverless SQL, there have been some significant jumps we can leverage, such as built-in predictive IO and intelligent autoscaling.

Know more about the different types of SQL warehouses.
Click here

Key differences between the Serverless SQL warehouse and the Classic & Pro SQL warehouses

Serverless SQL spin-up time is so much faster, with a max of 5-10 seconds, and guarantees consistency in the execution time, whereas both classic and pro compute take a few minutes to start up. Then, the execution starts, making it available for many use cases like streaming data and close to real BI insights, all at no additional cost of running the clusters continuously. 

Cost-effective - If each cluster type is already up, the run times are almost comparable, but the additional cost is associated with keeping the clusters running for classic and pro when idle.
Also, classic and pro compute run on the selected cloud provider, so there would be two kinds of charges applied (one for Databricks Unit (DBU) usage and the other for running instances on the cloud provider), whereas the serverless option is all managed by Databricks and pricing is pretty good considering it does not have any cloud costs.

Click here to learn about the pricing details of Databricks SQL warehouse.

Let’s delve into the specifics related to DBT and Databricks. If you are working on the Databricks interface, DBT Cloud is available on Databricks Partner Connect, which is very easy to integrate. For someone working on DBT CLI, there is a dbt-databricks adapter, which builds upon the foundation of dbt-spark. Recent updates to this package leverage Databricks’ latest features, such as the Photon compute engine, instant scalability via Serverless SQL Warehouses, and the Unity Catalog governance model. Additionally, features like streaming tables and materialized views enhance the overall capabilities of this integration.

This is a streaming analytics use case workflow with DBT and Databricks using Serverless SQL. With data sharing becoming more common across platforms, DBT can combine data across different systems to gather valuable insights.

Let’s compare the runtimes for executing a DBT project on Serverless, Classic, and Pro SQL warehouses. For simplicity, we maintained the scaling cluster count at a minimum of 1 and a maximum of 1. It’s important to note that each cluster’s time is in a terminated state to observe the effect of start-up run time.

To provide context for the dbt project used in this benchmark, it consisted of 18 models, 3 analyses, 56 tests, 1 seed, 12 sources, 3 exposures, 0 metrics, 738 macros, and 0 groups. Among these, 9 view models, 8 table models, and 1 incremental model were executed concurrently with 25 threads.

Considering the lowest time taken by the Serverless cluster on start-up, it suggests the efficiency of handling workloads with low latency requirements.

Next, let’s compare the provisioning time for different-sized clusters using the Serverless type.

From the results above, there are a few key takeaways. 

Regardless of the cluster size, the Serverless warehouse starts up exceptionally quickly. Interestingly, the provisioning time for a 4X-Large computation is roughly the same as that for a 2X small cluster. This suggests that the cluster size does not significantly impact the start-up time.

However, when dealing with a simple project, opting for a huge cluster may lead to higher costs without providing substantial performance gains. It’s essential to consider the Database Units (DBUs) associated with each cluster type. Evaluating the optimal cluster size ensures efficient utilization of worker nodes. Striking the right balance between performance and cost is crucial.

Here’s what is also available in the transformations space where DBT is currently being used

Databricks announced Delta Live Tables (DLT) in 2022, which works very similar to DBT being able to write transformations in SQL and Python and can execute in batch or continuous mode. DLT is relatively new in the market and is starting to gain momentum. If you are already using DBT or your source systems are outside of Databricks, DBT would work well. But if you are on Databricks setting up transformations freshly, DLT could be a great option for setting up low-latency systems. Otherwise, both are comparable in several features, such as rendering data lineage graphs, documentation, ability to test data, incremental tables, data quality and reliability, logging, and development to production.  

Below is the interface of DLT and DBT

Source: Databricks

Source: dbt Labs

Conclusion

In Summary, DBT with Databricks Serverless SQL enables users to perform data transformations and set up queries with minimal delay, making it suitable for real-time or near-real-time analytics. By leveraging serverless SQL capabilities, users avoid needing dedicated infrastructure, paying only for the resources used during query execution. Overall, this combination provides efficient data processing without incurring unnecessary costs.

If you want to see further differences between serverless and classic computations, Microsoft also covers them in these articles.

Enjoy exploring Serverless SQL with DBT and Databricks!

Pragathi Sharma

AUTHOR - FOLLOW
Pragathi Sharma
Manager, Data Engineering, Tredence Inc.

Topic Tags


Img-Reboot

Detailed Case Study

Implementing a modern supply chain platform for a $15B hygiene solutions firm

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.

Img-Reboot

Detailed Case Study

MIGRATING LEGACY APPLICATIONS TO A MODERN SUPPLY CHAIN PLATFORM FOR A LEADING $15 BILLION WATER, SANITATION, AND INFECTION PREVENTION SOLUTIONS PROVIDER

Learn how a Tredence client integrated all its data into a single data lake with our 4-phase migration approach, saving $50K/month! Reach out to us to know more.


Next Topic

Insights into Anomaly Detection Within Cross-Sectional Data



Next Topic

Insights into Anomaly Detection Within Cross-Sectional Data


0
Shares

1278
Reads

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.