Revolutionize LLM Customization: Databricks AI Builder with TAO

Databricks AI Builder: Revolutionizing LLM Fine-Tuning Without Labeled Data

Databricks recently introduced Test-time Adaptive Optimization (TAO), a groundbreaking approach to fine-tuning large language models that eliminates the need for labeled data. This innovative method leverages test-time compute and reinforcement learning to enable Enterprises to enhance AI model performance using only unlabeled usage data (or minimal labeled data) they already possess. TAO can elevate open-source models to compete with more expensive proprietary alternatives while significantly reducing development costs and inference time.

Introducing Databricks AI Builder

Databricks AI Builder

Databricks AI Builder is a new beta service as of April 2025. It currently offers two use cases:

Information Extraction - providing an unstructured document and transforming key information into JSON format. Example use case can be extracting key information from legal documents.
Model Specialization - the user interface behind TAO! We will have a deep dive into the four steps in TAO and how to do that in Model Specialization. But in a nutshell, model specialization allows us to fine-tune an LLM with reinforcement learning happening all behind the scene and provide a model as serving endpoint for batch inference.

The four steps of TAO

TAO pipeline

As shown in the above diagram, TAO involved the below four steps:

Response Generation
Response Scoring
Reinforcement Learning Training
Continuous Improvement

Experiment Setup

Model Specialization UI

There isn't a whole lot required to start with Model Specialization. In fact, we only need two simple things.

A problem statement. It really does not have to be long but be precise of what you want to do. For example, "Summarize the chess games. Tell me how each player did in a simple and easy understand way"
Unlabeled data - the data for the problem statement. In our case, it's unannotated PGNs.

Note with all reinforcement learning, the goal is always to reach the optimal point, so it is ideal if we know exactly what we want to get out of the fine-tuning. Hence, obtaining ~100 human approved ground truth will speed up the tuning process.

For example, AlphaZero was trained by simply playing against itself multiple times, using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks. Training took several days, totaling about 41 TPU-years.

But rest assured, TAO can be done in just minutes and not days!

Response Generation

After experiment setup, we can start the tuning process. The first step is Response Generation.

This stage begins with collecting example input prompts or queries for a task. In Model Specialization, these are called Evaluation Criteria.

Evaluation Criteria

Each criteria is then used to generate a diverse set of candidate responses.

Other than collecting manually via MLflow traces, Databricks will also recommend these prompts for you automatically. A rich spectrum of generation strategies is applied to generation recommendations, ranging from simple chain-of-thought prompting to sophisticated reasoning and structured prompting techniques.

Response Scoring

If you are familiar with Agent Evaluation, this is where you can find familiar interface. In this stage, generated responses are evaluated against the prompts. Scoring methodologies include a variety of strategies, such as reward modeling, preference-based scoring, or task-specific verification utilizing LLM judges or custom rules. This stage ensures each generated response is quantitatively assessed for quality and alignment with criteria. Rest assured, these are done for you automatically.

Agent Evaluation

Reinforcement Learning Training

In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Once again, with a no-code solution - Update agent will do the heavy lifting for you. Everything is behind the scene and serverless.

Update agent will kick off reinforcement learning

Continuous Improvement

As usual, the model will be deployed to a serving endpoint, you can begin generating training data for the next round of TAO using batch inference (ai_query). On Databricks, your LLM can get better the more you use it, thanks to TAO.

model serving endpoint

Serverless batch inference is a ground-breaking feature that you do not need any infrastructure yourself for inferencing and you can use the endpoint immediately without any delay. As seen below, the model is able to annotate 20 un-annotated games in 2 minutes 51 seconds.

Worried about boxed-in? These endpoints can be used anywhere using OpenAI SDK, similar to Databricks' FMAPI or any model APIs on the market.

Chess specialization fine-tuned model with AI Builder

Conclusion

Databricks' Test-time Adaptive Optimization (TAO) represents a significant breakthrough in how language models can be customized for Enterprise applications. By eliminating the need for labeled data while achieving impressive performance improvements, TAO addresses one of the most persistent barriers to Enterprise AI adoption.

Combining end-to-end serverless API, latest research techniques, user-friendly reinforcement learning, agent evaluation, model serving, and batch inference. Everything is out of the box and ready to use in no code and low code fashion. AI Builder has the potential to accelerate AI implementation across industries by making high-quality, domain-specific language models more accessible and cost-effective. For organizations seeking to leverage the power of language models without committing extensive resources to data labeling, TAO offers a promising path forward that could fundamentally change how specialized AI systems are developed and deployed.

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Next Topic

The First Ever Databricks Claude Sonnet vs. Stockfish Tournament

Next Topic

Databricks AI Builder: Revolutionizing LLM Fine-Tuning Without Labeled Data

Like the blog

Table of contents

Like the blog

Table of contents

Introducing Databricks AI Builder

The four steps of TAO

Experiment Setup

Response Generation

Response Scoring

Reinforcement Learning Training

Continuous Improvement

Conclusion

The First Ever Databricks Claude Sonnet vs. Stockfish Tournament

The First Ever Databricks Claude Sonnet vs. Stockfish Tournament

recommended articles

Thank you for a like!

Share this article

Industries

Services

Solutions

Blogs

Data & AI 101

Client Success

Life at Tredence

Careers

Contact us

CSR Framework

Certifications

Follow us on