How LLMs Learn to Play Games
Games have been the benchmark for artificial intelligence since the beginning. From Deep Blue to AlphaGo, the question "can a machine play this?" has pushed the field forward for decades. But large language models are different โ they don't see pixels or evaluate board positions. They read and write text.
So how do you get a text model to play a visual game? The answer turns out to be surprisingly elegant: structured outputs.
The Key Insight
Instead of asking an LLM to freestyle its response โ which leads to hallucinated moves, invalid coordinates, and verbose explanations โ you give it a JSON schema that defines exactly what a valid move looks like.
The schema acts as a constraint. The model must respond with a row, a column, and optionally its reasoning. Nothing else. No "As an AI language model, I'd like to play..." โ just the move.
Click between the tabs above. The schema defines the shape; the response fills it. When you constrain the output, the model can't hallucinate invalid moves. It responds within the schema โ a row, a column, and nothing else.
Reading the Board
The model also needs to see the game state. For a grid-based game like tic-tac-toe, we encode the board as a 2D array of strings: "X", "O", or "_" for empty.
Try it yourself โ click any cell to place an X. Watch how the game state updates in the JSON on the right. This is exactly what the LLM sees as input and produces as output.
The board representation is deliberately simple. LLMs excel at reasoning over structured text, and a 3ร3 grid of single characters fits easily within a prompt. For more complex games โ Connect 4, chess, Go โ the same principle applies, just with larger grids and more piece types.
What Actually Works
Not all models are equal. We tested six models on 200 games of tic-tac-toe against a random opponent, all using identical prompts and schemas. The results are striking:
The pattern is clear: reasoning models significantly outperform completion models, even with identical prompts and schemas. Claude 3.5 Sonnet and GPT-4o both achieve near-optimal play, while smaller models like GPT-3.5 and Llama 3 struggle to maintain consistent strategy.
The interesting finding isn't that bigger models play better โ it's how they play better. The reasoning field in the schema gives models a place to think out loud. The best models use it to evaluate threats, plan forks, and reason about their opponent's strategy. Smaller models tend to describe the board state without analyzing it.
What's Next
Tic-tac-toe is a solved game โ the real test is scaling this approach to games with larger state spaces. Connect 4, Reversi, and card games each introduce new challenges: deeper search trees, hidden information, and probabilistic outcomes.
The structured output approach scales surprisingly well. The schema gets more complex, but the fundamental pattern โ constrain the output, encode the state, let the model reason โ remains the same.
More on this as the experiments continue. ๐ฑ