Databricks Agents vs. Chess Engines: A Strategic Showdown

In the previous blog, we explored how function calling enables large language models (LLMs) to interact with external tools. While LLMs excel at generating human-like text and reasoning through natural language, they struggle with tasks requiring precise computation, real-time data, or domain-specific expertise. For example, chess is a game of strategic depth and near-infinite possibilities, despite the advancement of these Large Language Models, they are still not good at playing chess. However, in 2017, DeepMind’s AlphaZero was able to defeat world-champion chess engines Stockfish and Elmo (https://en.wikipedia.org/wiki/AlphaZero) and subsequently the creators of AlphaFold also won the Nobel Prize in Chemistry in 2024.

In this article, we want to determine if LLM has a chance against a strong chess engine stockfish, we will build an AI agent on Databricks to play chess with each other. We’ll also leverage Databricks and MLflow agent tracing to log the interactions between the prompt and the LLM response. To demonstrate the flexibility of Databricks’ platform, we will leverage an open-source codebase as a drop-in framework for Databricks’ agent tracing — demonstrating that there is no migration requirements to onboard to Databricks’ GenAI platform.

Experiment Setup

Databricks has added agent tracing to all the popular agent frameworks out there. A full list can be found here: https://docs.databricks.com/aws/en/mlflow/mlflow-tracing

As mentioned before, the advantage of Databricks is that there is no vendor lock-in and is compatible with any code out there. We can easily bring in some code from Github and add MLflow autolog on top. Databricks also provides out of the box Foundation Model API (FMAPI), so that we can also easily allow an LLM to play chess within the Databricks environment. However, agent tracing is also supported when using an external API.

We will leverage the repo from Franck S. Ndzomga, “What happens when LLMs play chess?,” 2024.

GitHub repository:
https://github.com/fsndzomga/chess_tournament_nebius_dspy

To leverage some of the built-in Databricks functionailities, we will make the following changes:

Download stockfish ubuntu version from https://stockfishchess.org/download/ and place the binary in ADLS
Add an additional provider called “databricks” in chess_model.py by leveraging the OpenAI client. Remeber traces are also supported in external APIs.

^{if self.provider == 'databricks':}
^{client = OpenAI(}
^{api_key="dapi-your-databricks-token",}
^{base_url="https://example.staging.cloud.databricks.com/serving-endpoints"}
⁾

3. In models.py, we can add in the following endpoints easily:

^{
^{'name': 'databricks-meta-llama-3–3–70b-instruct',}
^{'provider': 'databricks',}
^{'model_id': 'databricks-meta-llama-3–3–70b-instruct',}
^{'rating': 1500}
^},

4. The only code we are “migrating” to notebook is main.py because we want to see Databricks playing chess. And here we will also add one line of code (if we exclude import):

^{mlflow.dspy.autolog()}

The purpose of this line of code is to capture the traces.

5. And now we have a chess playing agent on Databricks!

The Results?

Stockfish won! Despite the disruption DeepSeek caused in the AI industry and the stock market, it still couldn’t win against Stockfish (remember AlphaZero did win in 2017). But how bad is it? We have done a game analysis using chess.com. As you can see below, DeepSeek failed to gain momentum after 3 moves.

We can find the traces in the Experiments tab, as we can see — reasoning models can take time to think:

Why LLMs Struggle with Chess?

There are a few reasons why LLMs fail at chess:

1. Training Data Limitations

LLMs are trained on vast text corpora, including books, articles, and online content. While they may encounter chess notation (e.g., “Qh5” or “Nf3”) or game analyses during training, this data is fragmented and incomplete. LLMs lack the structured, iterative practice required to internalize chess strategy. For example, they might recognize that “castling” is a defensive move but fail to execute it optimally in a live game.

2. No Internal Game State Representation

Chess requires maintaining and updating a dynamic board state. LLMs, however, process inputs as static sequences of tokens. They cannot natively track piece positions, legal moves, or game history. Without tools, an LLM must rely on verbose textual descriptions of the board (e.g., “White’s pawn is on e4, Black’s knight is on c6”), which are error-prone and inefficient.

3. Lack of Search and Evaluation

Chess engines like Stockfish or AlphaZero use tree search algorithms (e.g., Monte Carlo Tree Search) and evaluation functions to assess billions of positions per second. LLMs, by contrast, generate responses based on probabilistic patterns in their training data. They cannot simulate future moves or quantify positional advantages (e.g., “controlling the center” or “pawn structure”).

Human in the Loop

We can use a game of chess to simulate a business objective. In our case, we have a very simply objective — to win a game — but unlimited possibilities. In the next post, we will examine how we can use Databricks Review App UI to provide “ground truth” for the agent.

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Next Topic

Human Agents in the Loop: How to Use Databricks’ Review App

Next Topic

Databricks Agents: A Chess Story

Like the blog

Table of contents

Like the blog

Table of contents

Experiment Setup

The Results?

Why LLMs Struggle with Chess?

Human in the Loop

Human Agents in the Loop: How to Use Databricks’ Review App

Human Agents in the Loop: How to Use Databricks’ Review App

recommended articles

Thank you for a like!

Share this article

Industries

Services

Solutions

Blogs

Data & AI 101

Client Success

Life at Tredence

Careers

Contact us

CSR Framework

Certifications

Follow us on