Fine-tuning a chess commentary model

In the previous post, we discussed that there are three types of fine-tuning techniques provided by Databricks.

Chat Completion: Fine-tune your model on chat logs between a user and an AI assistant
Continued Pretraining: Train your model with additional text data to add new knowledge to a model
Instruction Fine-tuning: Fine-tune your model on structured prompt-response data to adapt the model to a new task

In this post, we will attempt to fine-tune Llama 3.3 70B instruct to do a better job in chess commentary. We have the best chess engines in the world that can beat the best human player because it can virtually look at infinite number of moves per second and evaluate the best move. There are also battles between AI engine vs search-based engine. Unfortunately, without human-like commentary, it is hard to learn from these wonderful battles. This is one reason why we need to fine-tune an LLM for better chess commentary with Chat Completion.

Why does Databricks recommend Chat Completion?

1. Standardization

It enforces a clean separation between different speaker roles (system, user, assistant).
This aligns well with the format that popular chat API requires (e.g. OpenAI ChatCompletion API)

2. Better Alignment for Fine-Tuning

If you’re training or fine-tuning an LLM on Databricks, chat-format data helps align the model with human interaction patterns, improving downstream performance for agents, assistants, copilots, etc.

3. Improved Model Behavior

When trained in chat format, models tend to follow instructions better and are easier to steer with a system prompt.
System prompts (e.g., “You are a chess master”) have a powerful effect and are only possible in this structure.

4. Tooling and Ecosystem Support

Libraries like OpenAI’s API, LangChain, DSPy, and LangGraph are optimized for chat-format interaction.
Databricks’ foundation model training API also provides utilities that automatically converts the data into the appropriate format for the specific model (e.g. Llama 4 prompt format).

In this post we strive to do more (coding) with less (writing). It’s time to get started. And throughout this experiment, we will leverage two repos.

Experiment setup

1. In our data prep notebook, we created a prompt template and a special structured format to give the LLM as much hint as possible on every move.

2. We compare the commentary between a few different LLMs including

Claude Sonnet 3.7 Reasoning mode
Meta Llama 4 Maverick
Our new chess commentary model

3. We evaluated a few games between Stockfish and the above LLMs

Conclusion

We found that the commentary model sounds more like human because there’s more emotions to the game itself and a human should care!

World Champion Magnus Carlsen (credit: chess.com)

Take this comment as an example from the commentary model:

"Black voluntarily transposes the king from its moderately safe position in the back rank to a more dangerous and vulnerable stay in the middle of the board."

Compared to the LLM analysis, which sounds like a robot:

"Black moves the king to e7, trying to escape the check and reorganize the position. However, this might weaken the king’s safety further and create long-term issues."

Below you can check out the games yourself along with chess engine analysis and the comments!

Stockfish vs Claude Sonnet chess tournament

Source code:

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Next Topic

Crafting Loyalty That Lasts: Data & AI in Travel & Hospitality

Next Topic

Fine-tuning a Chess Commentary Model

Like the blog

Table of contents

Like the blog

Table of contents

Why does Databricks recommend Chat Completion?

Experiment setup

Conclusion

Crafting Loyalty That Lasts: Data & AI in Travel & Hospitality

Crafting Loyalty That Lasts: Data & AI in Travel & Hospitality

recommended articles

Thank you for a like!

Share this article

Industries

Services

Solutions

Blogs

Data & AI 101

Client Success

Life at Tredence

Careers

Contact us

CSR Framework

Certifications

Follow us on