Databricks Agents: A Chess Story

Databricks

Date : 04/22/2025

Databricks

Date : 04/22/2025

Databricks Agents: A Chess Story

Explore how Databricks agents interact with chess engines like Stockfish, showcasing the challenges of Large Language Models (LLMs) in chess. Learn how agent tracing and Databricks’ GenAI platform enhance AI performance in this experiment.

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Databricks Agents vs. Chess Engines
Like the blog

Table of contents

Databricks Agents: A Chess Story

  • Experiment Setup
  • Why LLMs Struggle with Chess?

Table of contents

Databricks Agents: A Chess Story

  • Experiment Setup
  • Why LLMs Struggle with Chess?
Databricks Agents vs. Chess Engines

In the previous blog, we explored how function calling enables large language models (LLMs) to interact with external tools. While LLMs excel at generating human-like text and reasoning through natural language, they struggle with tasks requiring precise computation, real-time data, or domain-specific expertise. For example, chess is a game of strategic depth and near-infinite possibilities, despite the advancement of these Large Language Models, they are still not good at playing chess. However, in 2017, DeepMind’s AlphaZero was able to defeat world-champion chess engines Stockfish and Elmo (https://en.wikipedia.org/wiki/AlphaZero) and subsequently the creators of AlphaFold also won the Nobel Prize in Chemistry in 2024.

In this article, we want to determine if LLM has a chance against a strong chess engine stockfish, we will build an AI agent on Databricks to play chess with each other. We’ll also leverage Databricks and MLflow agent tracing to log the interactions between the prompt and the LLM response. To demonstrate the flexibility of Databricks’ platform, we will leverage an open-source codebase as a drop-in framework for Databricks’ agent tracing — demonstrating that there is no migration requirements to onboard to Databricks’ GenAI platform.

Experiment Setup

Databricks has added agent tracing to all the popular agent frameworks out there. A full list can be found here: https://docs.databricks.com/aws/en/mlflow/mlflow-tracing

As mentioned before, the advantage of Databricks is that there is no vendor lock-in and is compatible with any code out there. We can easily bring in some code from Github and add MLflow autolog on top. Databricks also provides out of the box Foundation Model API (FMAPI), so that we can also easily allow an LLM to play chess within the Databricks environment. However, agent tracing is also supported when using an external API.

We will leverage the repo from Franck S. Ndzomga, “What happens when LLMs play chess?,” 2024.

GitHub repository:
https://github.com/fsndzomga/chess_tournament_nebius_dspy

To leverage some of the built-in Databricks functionailities, we will make the following changes:

  1. Download stockfish ubuntu version from https://stockfishchess.org/download/ and place the binary in ADLS
  2. Add an additional provider called “databricks” in chess_model.py by leveraging the OpenAI client. Remeber traces are also supported in external APIs.
if self.provider == 'databricks':
client = OpenAI(
api_key="dapi-your-databricks-token",
base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

3. In models.py, we can add in the following endpoints easily:

{
'name': 'databricks-meta-llama-3–3–70b-instruct',
'provider': 'databricks',
'model_id': 'databricks-meta-llama-3–3–70b-instruct',
'rating': 1500
},

4. The only code we are “migrating” to notebook is main.py because we want to see Databricks playing chess. And here we will also add one line of code (if we exclude import):

mlflow.dspy.autolog()

The purpose of this line of code is to capture the traces.

5. And now we have a chess playing agent on Databricks!

The Results?

Stockfish won! Despite the disruption DeepSeek caused in the AI industry and the stock market, it still couldn’t win against Stockfish (remember AlphaZero did win in 2017). But how bad is it? We have done a game analysis using chess.com. As you can see below, DeepSeek failed to gain momentum after 3 moves.

We can find the traces in the Experiments tab, as we can see — reasoning models can take time to think:

Why LLMs Struggle with Chess?

There are a few reasons why LLMs fail at chess:

1. Training Data Limitations

LLMs are trained on vast text corpora, including books, articles, and online content. While they may encounter chess notation (e.g., “Qh5” or “Nf3”) or game analyses during training, this data is fragmented and incomplete. LLMs lack the structured, iterative practice required to internalize chess strategy. For example, they might recognize that “castling” is a defensive move but fail to execute it optimally in a live game.

2. No Internal Game State Representation

Chess requires maintaining and updating a dynamic board state. LLMs, however, process inputs as static sequences of tokens. They cannot natively track piece positions, legal moves, or game history. Without tools, an LLM must rely on verbose textual descriptions of the board (e.g., “White’s pawn is on e4, Black’s knight is on c6”), which are error-prone and inefficient.

3. Lack of Search and Evaluation

Chess engines like Stockfish or AlphaZero use tree search algorithms (e.g., Monte Carlo Tree Search) and evaluation functions to assess billions of positions per second. LLMs, by contrast, generate responses based on probabilistic patterns in their training data. They cannot simulate future moves or quantify positional advantages (e.g., “controlling the center” or “pawn structure”).

Human in the Loop

We can use a game of chess to simulate a business objective. In our case, we have a very simply objective — to win a game — but unlimited possibilities. In the next post, we will examine how we can use Databricks Review App UI to provide “ground truth” for the agent.

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.


Next Topic

Human Agents in the Loop: How to Use Databricks’ Review App



Next Topic

Human Agents in the Loop: How to Use Databricks’ Review App


Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

×
Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article
×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.