Llm Games Structured Outputs [1]
As much as I love chess, I will not be evaluating the agent on Chess. This game is unique as it as already been heavily modeled with large datasets available all over the internet [2]. LLMs are trained on chess data from these datasets, skewing the results and not proving to us if the LLM is able to reason about its next move without relying on its pretrained memory of the game [3]. Example: 1. e4 e5 2. ... (PGN chess notation; LLMs are great at predicting the next token here)
Game Analysis
The performance of LLMs across different games reveals interesting patterns in their reasoning capabilities. While some games like Tic-Tac-Toe show near-perfect performance, more complex games like Connect 4 demonstrate the limitations of current models [4].
This analysis focuses on structured outputs and how they enable better game-playing performance through clear formatting and rule adherence [5].
The ELO rating system is used to calculate the relative skill levels of players in competitive games. In our LLM game evaluation, we use ELO ratings to measure how well different models perform against each other.
The expected score formula is:
Where
After a game, the new rating is calculated using:
Where
This system allows us to objectively compare LLM performance across different games and track improvement over time as models are refined.
1, 2, 3...