Fine-tuning a Chess Commentary Model

Databricks

Date : 04/22/2025

Databricks

Date : 04/22/2025

Fine-tuning a Chess Commentary Model

Explore how fine-tuning the Llama 3.3 70B model with Databricks' Chat Completion technique enhances AI-generated chess commentary. Learn about the process, benefits, and comparison to other LLMs like Claude Sonnet and Meta Llama, with examples from top chess engines.

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Like the blog

Table of contents

Fine-tuning a Chess Commentary Model

  • Why does Databricks recommend Chat Completion?
  • Experiment setup
  • Conclusion

Table of contents

Fine-tuning a Chess Commentary Model

  • Why does Databricks recommend Chat Completion?
  • Experiment setup
  • Conclusion

In the previous post, we discussed that there are three types of fine-tuning techniques provided by Databricks.

  • Chat Completion: Fine-tune your model on chat logs between a user and an AI assistant
  • Continued Pretraining: Train your model with additional text data to add new knowledge to a model
  • Instruction Fine-tuning: Fine-tune your model on structured prompt-response data to adapt the model to a new task

In this post, we will attempt to fine-tune Llama 3.3 70B instruct to do a better job in chess commentary. We have the best chess engines in the world that can beat the best human player because it can virtually look at infinite number of moves per second and evaluate the best move. There are also battles between AI engine vs search-based engine. Unfortunately, without human-like commentary, it is hard to learn from these wonderful battles. This is one reason why we need to fine-tune an LLM for better chess commentary with Chat Completion.

Why does Databricks recommend Chat Completion?

1. Standardization

  • It enforces a clean separation between different speaker roles (system, user, assistant).
  • This aligns well with the format that popular chat API requires (e.g. OpenAI ChatCompletion API)

2. Better Alignment for Fine-Tuning

  • If you’re training or fine-tuning an LLM on Databricks, chat-format data helps align the model with human interaction patterns, improving downstream performance for agents, assistants, copilots, etc.

3. Improved Model Behavior

  • When trained in chat format, models tend to follow instructions better and are easier to steer with a system prompt.
  • System prompts (e.g., “You are a chess master”) have a powerful effect and are only possible in this structure.

4. Tooling and Ecosystem Support

  • Libraries like OpenAI’s API, LangChain, DSPy, and LangGraph are optimized for chat-format interaction.
  • Databricks’ foundation model training API also provides utilities that automatically converts the data into the appropriate format for the specific model (e.g. Llama 4 prompt format).

In this post we strive to do more (coding) with less (writing). It’s time to get started. And throughout this experiment, we will leverage two repos.

  1. Databricks fine-tuning example
  2. ChessGPT training data (annotated pgn) 

Experiment setup

1. In our data prep notebook, we created a prompt template and a special structured format to give the LLM as much hint as possible on every move.

2. We compare the commentary between a few different LLMs including

  • Claude Sonnet 3.7 Reasoning mode 
  • Meta Llama 4 Maverick 
  • Our new chess commentary model 

3. We evaluated a few games between Stockfish and the above LLMs

Conclusion

We found that the commentary model sounds more like human because there’s more emotions to the game itself and a human should care!

World Champion Magnus Carlsen (credit: chess.com)

Take this comment as an example from the commentary model:

"Black voluntarily transposes the king from its moderately safe position in the back rank to a more dangerous and vulnerable stay in the middle of the board."

Compared to the LLM analysis, which sounds like a robot:

"Black moves the king to e7, trying to escape the check and reorganize the position. However, this might weaken the king’s safety further and create long-term issues."

Below you can check out the games yourself along with chess engine analysis and the comments!

Stockfish vs Claude Sonnet chess tournament

Source code:

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.


Next Topic

Crafting Loyalty That Lasts: Data & AI in Travel & Hospitality



Next Topic

Crafting Loyalty That Lasts: Data & AI in Travel & Hospitality


Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.