
In the previous post, we discussed that there are three types of fine-tuning techniques provided by Databricks.
- Chat Completion: Fine-tune your model on chat logs between a user and an AI assistant
- Continued Pretraining: Train your model with additional text data to add new knowledge to a model
- Instruction Fine-tuning: Fine-tune your model on structured prompt-response data to adapt the model to a new task
In this post, we will attempt to fine-tune Llama 3.3 70B instruct to do a better job in chess commentary. We have the best chess engines in the world that can beat the best human player because it can virtually look at infinite number of moves per second and evaluate the best move. There are also battles between AI engine vs search-based engine. Unfortunately, without human-like commentary, it is hard to learn from these wonderful battles. This is one reason why we need to fine-tune an LLM for better chess commentary with Chat Completion.
Why does Databricks recommend Chat Completion?
1. Standardization
- It enforces a clean separation between different speaker roles (system, user, assistant).
- This aligns well with the format that popular chat API requires (e.g. OpenAI ChatCompletion API)
2. Better Alignment for Fine-Tuning
- If you’re training or fine-tuning an LLM on Databricks, chat-format data helps align the model with human interaction patterns, improving downstream performance for agents, assistants, copilots, etc.
3. Improved Model Behavior
- When trained in chat format, models tend to follow instructions better and are easier to steer with a system prompt.
- System prompts (e.g., “You are a chess master”) have a powerful effect and are only possible in this structure.
4. Tooling and Ecosystem Support
- Libraries like OpenAI’s API, LangChain, DSPy, and LangGraph are optimized for chat-format interaction.
- Databricks’ foundation model training API also provides utilities that automatically converts the data into the appropriate format for the specific model (e.g. Llama 4 prompt format).
In this post we strive to do more (coding) with less (writing). It’s time to get started. And throughout this experiment, we will leverage two repos.
Experiment setup
1. In our data prep notebook, we created a prompt template and a special structured format to give the LLM as much hint as possible on every move.
2. We compare the commentary between a few different LLMs including
- Claude Sonnet 3.7 Reasoning mode
- Meta Llama 4 Maverick
- Our new chess commentary model
3. We evaluated a few games between Stockfish and the above LLMs
Conclusion
We found that the commentary model sounds more like human because there’s more emotions to the game itself and a human should care!

Take this comment as an example from the commentary model:
"Black voluntarily transposes the king from its moderately safe position in the back rank to a more dangerous and vulnerable stay in the middle of the board."
Compared to the LLM analysis, which sounds like a robot:
"Black moves the king to e7, trying to escape the check and reorganize the position. However, this might weaken the king’s safety further and create long-term issues."
Below you can check out the games yourself along with chess engine analysis and the comments!

Source code:
- https://cdn.gisthostfor.me/rwforest-m4hNpkiUD1-chess_gpt_data_preprocessing.html
- https://cdn.gisthostfor.me/rwforest-yCKVL0xbe3-chat_completion_inference.html

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.