🌱 Seed

How LLMs Learn to Play Games

March 2026 · 4 min read

Games have been the benchmark for artificial intelligence since the beginning. From Deep Blue to AlphaGo, the question "can a machine play this?" has pushed the field forward for decades. But large language models are different — they don't see pixels or evaluate board positions. They read and write text.

So how do you get a text model to play a visual game? The answer turns out to be surprisingly elegant: structured outputs.

The Key Insight

Instead of asking an LLM to freestyle its response — which leads to hallucinated moves, invalid coordinates, and verbose explanations — you give it a JSON schema that defines exactly what a valid move looks like.

The schema acts as a constraint. The model must respond with a row, a column, and optionally its reasoning. Nothing else. No "As an AI language model, I'd like to play..." — just the move.

Interactive — Structured Output

// JSON Schema — constrains the LLM's output { "type": "object", "properties": { "reasoning": { "type": "string", "description": "Think about the board state" }, "row": { "type": "integer", "minimum": 0, "maximum": 2 }, "col": { "type": "integer", "minimum": 0, "maximum": 2 } }, "required": ["reasoning", "row", "col"] }

Click between the tabs above. The schema defines the shape; the response fills it. When you constrain the output, the model can't hallucinate invalid moves. It responds within the schema — a row, a column, and nothing else.

Reading the Board

The model also needs to see the game state. For a grid-based game like tic-tac-toe, we encode the board as a 2D array of strings: "X", "O", or "_" for empty.

Try it yourself — click any cell to place an X. Watch how the game state updates in the JSON on the right. This is exactly what the LLM sees as input and produces as output.

Interactive — Game State as Structured Data

Your turn — click to place X

// What the LLM sees and produces { "board": [ ["_", "_", "_"], ["_", "_", "_"], ["_", "_", "_"] ], "status": "in_progress", "next": "X" }

The board representation is deliberately simple. LLMs excel at reasoning over structured text, and a 3×3 grid of single characters fits easily within a prompt. For more complex games — Connect 4, chess, Go — the same principle applies, just with larger grids and more piece types.

What Actually Works

Not all models are equal. We tested six models on 200 games of tic-tac-toe against a random opponent, all using identical prompts and schemas. The results are striking:

Win Rate vs. Random Opponent (Tic-Tac-Toe)

Claude 3.5 Sonnet

92%

GPT-4o

89%

o1-mini

85%

Gemini 1.5 Pro

72%

GPT-3.5 Turbo

48%

Llama 3 70B

41%

The pattern is clear: reasoning models significantly outperform completion models, even with identical prompts and schemas. Claude 3.5 Sonnet and GPT-4o both achieve near-optimal play, while smaller models like GPT-3.5 and Llama 3 struggle to maintain consistent strategy.

The interesting finding isn't that bigger models play better — it's how they play better. The reasoning field in the schema gives models a place to think out loud. The best models use it to evaluate threats, plan forks, and reason about their opponent's strategy. Smaller models tend to describe the board state without analyzing it.

What's Next

Tic-tac-toe is a solved game — the real test is scaling this approach to games with larger state spaces. Connect 4, Reversi, and card games each introduce new challenges: deeper search trees, hidden information, and probabilistic outcomes.

The structured output approach scales surprisingly well. The schema gets more complex, but the fundamental pattern — constrain the output, encode the state, let the model reason — remains the same.

More on this as the experiments continue. 🌱