
Databricks recently introduced Test-time Adaptive Optimization (TAO), a groundbreaking approach to fine-tuning large language models that eliminates the need for labeled data. This innovative method leverages test-time compute and reinforcement learning to enable Enterprises to enhance AI model performance using only unlabeled usage data (or minimal labeled data) they already possess. TAO can elevate open-source models to compete with more expensive proprietary alternatives while significantly reducing development costs and inference time.
Introducing Databricks AI Builder
Databricks AI Builder
Databricks AI Builder is a new beta service as of April 2025. It currently offers two use cases:
- Information Extraction - providing an unstructured document and transforming key information into JSON format. Example use case can be extracting key information from legal documents.
- Model Specialization - the user interface behind TAO! We will have a deep dive into the four steps in TAO and how to do that in Model Specialization. But in a nutshell, model specialization allows us to fine-tune an LLM with reinforcement learning happening all behind the scene and provide a model as serving endpoint for batch inference.
The four steps of TAO
TAO pipeline
As shown in the above diagram, TAO involved the below four steps:
- Response Generation
- Response Scoring
- Reinforcement Learning Training
- Continuous Improvement
Experiment Setup
Model Specialization UI
There isn't a whole lot required to start with Model Specialization. In fact, we only need two simple things.
- A problem statement. It really does not have to be long but be precise of what you want to do. For example, "Summarize the chess games. Tell me how each player did in a simple and easy understand way"
- Unlabeled data - the data for the problem statement. In our case, it's unannotated PGNs.
Note with all reinforcement learning, the goal is always to reach the optimal point, so it is ideal if we know exactly what we want to get out of the fine-tuning. Hence, obtaining ~100 human approved ground truth will speed up the tuning process.
For example, AlphaZero was trained by simply playing against itself multiple times, using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks. Training took several days, totaling about 41 TPU-years.
But rest assured, TAO can be done in just minutes and not days!
Response Generation
After experiment setup, we can start the tuning process. The first step is Response Generation.
This stage begins with collecting example input prompts or queries for a task. In Model Specialization, these are called Evaluation Criteria.
Evaluation Criteria
Each criteria is then used to generate a diverse set of candidate responses.
Other than collecting manually via MLflow traces, Databricks will also recommend these prompts for you automatically. A rich spectrum of generation strategies is applied to generation recommendations, ranging from simple chain-of-thought prompting to sophisticated reasoning and structured prompting techniques.
Response Scoring
If you are familiar with Agent Evaluation, this is where you can find familiar interface. In this stage, generated responses are evaluated against the prompts. Scoring methodologies include a variety of strategies, such as reward modeling, preference-based scoring, or task-specific verification utilizing LLM judges or custom rules. This stage ensures each generated response is quantitatively assessed for quality and alignment with criteria. Rest assured, these are done for you automatically.
Agent Evaluation
Reinforcement Learning Training
In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Once again, with a no-code solution - Update agent will do the heavy lifting for you. Everything is behind the scene and serverless.
Update agent will kick off reinforcement learning
Continuous Improvement
As usual, the model will be deployed to a serving endpoint, you can begin generating training data for the next round of TAO using batch inference (ai_query). On Databricks, your LLM can get better the more you use it, thanks to TAO.
model serving endpoint
Serverless batch inference is a ground-breaking feature that you do not need any infrastructure yourself for inferencing and you can use the endpoint immediately without any delay. As seen below, the model is able to annotate 20 un-annotated games in 2 minutes 51 seconds.
Worried about boxed-in? These endpoints can be used anywhere using OpenAI SDK, similar to Databricks' FMAPI or any model APIs on the market.
Chess specialization fine-tuned model with AI Builder
Conclusion
Databricks' Test-time Adaptive Optimization (TAO) represents a significant breakthrough in how language models can be customized for Enterprise applications. By eliminating the need for labeled data while achieving impressive performance improvements, TAO addresses one of the most persistent barriers to Enterprise AI adoption.
Combining end-to-end serverless API, latest research techniques, user-friendly reinforcement learning, agent evaluation, model serving, and batch inference. Everything is out of the box and ready to use in no code and low code fashion. AI Builder has the potential to accelerate AI implementation across industries by making high-quality, domain-specific language models more accessible and cost-effective. For organizations seeking to leverage the power of language models without committing extensive resources to data labeling, TAO offers a promising path forward that could fundamentally change how specialized AI systems are developed and deployed.

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.