LLM Games: Structured Outputs

Sep 15, 2025•Lucas Jaggernauth

Llm Games Structured Outputs [1]

As much as I love chess, I will not be evaluating the agent on Chess. This game is unique as it as already been heavily modeled with large datasets available all over the internet [2]. LLMs are trained on chess data from these datasets, skewing the results and not proving to us if the LLM is able to reason about its next move without relying on its pretrained memory of the game [3]. Example: 1. e4 e5 2. ... (PGN chess notation; LLMs are great at predicting the next token here)

Game Analysis

The performance of LLMs across different games reveals interesting patterns in their reasoning capabilities. While some games like Tic-Tac-Toe show near-perfect performance, more complex games like Connect 4 demonstrate the limitations of current models [4].

This analysis focuses on structured outputs and how they enable better game-playing performance through clear formatting and rule adherence [5].

Connect 4

LLM Performance by Game Type

Reasoning Tokens vs Performance

ELO Rating System

The ELO rating system is used to calculate the relative skill levels of players in competitive games. In our LLM game evaluation, we use ELO ratings to measure how well different models perform against each other.

The expected score formula is:

EA =

1 + 10(RB - RA)/400

Where

E_A = Expected score for player A
R_A = Current rating of player A
R_B = Current rating of player B

After a game, the new rating is calculated using:

R'A = RA + K × (SA - EA)

Where

R'_A = New rating for player A
K = K-factor (typically 32 for new players)
S_A = Actual score (1 for win, 0.5 for draw, 0 for loss)

This system allows us to objectively compare LLM performance across different games and track improvement over time as models are refined.

1, 2, 3...