Can a smaller model beat a larger one β just by writing its own rules?
Google DeepMind
March 5, 2026
Language models (LLMs) are remarkably capable at coding and math β but when acting as game-playing agents, they often make illegal moves. In a recent Kaggle chess competition, 78% of Gemini 2.5 Flash's losses were due to rule violations, not bad strategy. People normally write "harnesses" β wrapper code that validates moves β by hand. AutoHarness shows that Gemini 2.5 Flash can write its own harness automatically, using a small number of iterative code refinement rounds. The resulting harness prevents all illegal moves across 145 TextArena games, enabling the smaller Flash model to outperform the larger Gemini 2.5 Pro β while also being more cost-effective.
AutoHarness achieved 100% legal action rate across all 145 TextArena games (both 1-player and 2-player), verified on 1,000 test rollouts per game.
Gemini 2.5 Flash + Harness wins 9/16 two-player games against the much larger Gemini 2.5 Pro (overall win rate 56.3% vs. Pro's 38.2%).
Harness-as-Policy generates a pure Python policy β no LLM needed at test time. Average reward 0.870 on 16 one-player games, beating GPT-5.2-High (0.844) at near-zero compute cost.
LLMs have shown remarkable ability at coding and solving math problems. However, their planning and reasoning performance as agents can be brittle. In the recent Kaggle GameArena chess competition, 78% of losses by Gemini 2.5 Flash were attributed not to bad strategy, but to simple illegal moves β moves strictly prohibited by the rules of chess.
Traditional fixes include manually writing "harness" code to filter invalid moves, or fine-tuning on game trajectories. But manual harnesses are brittle and labor-intensive β requiring new work for every game. Fine-tuning flagship-scale models is expensive and can degrade performance on other tasks. AutoHarness takes a different approach: use the LLM's own code-generation capability to write and refine the harness automatically.
An agent is the combination of an LLM and a harness that acts as "glue" between the model and the task. In AutoHarness, the LLM completes the agent by coding its own harness. The harness has two key functions: propose_action(obs) generates candidate moves, and is_legal_action(obs, action) verifies legality. This turns the model into a rejection sampler β it keeps proposing until a legal action is found.
def propose_action(obs):
# Returns the agent's next move
...
def is_legal_action(obs, action):
# Returns True if the action is legal
...
propose_action(obs) acts as the proposal distribution β it generates a candidate move. is_legal_action(obs, action) is the acceptance criterion β it checks whether the move satisfies the game's rules. If the move is rejected, the LLM is re-prompted and proposes again until a legal move is found.AutoHarness formulates harness generation as a search over program space, guided by Thompson sampling. The LLM acts as a mutation operator, proposing code refinements based on feedback from environment execution. A tree search balances exploration (trying distinct logic structures) and exploitation (refining a partially working harness).
AutoHarness maintains multiple code hypotheses in a tree structure, using Thompson sampling to choose which node to refine next. The heuristic value for each node is the average legal move accuracy achieved by that code version. When the code has a bug β is_legal_action() returns True but the move is actually illegal β both propose_action() and is_legal_action() are refined. When only is_legal_action() returns False (correctly detecting an illegal move), only propose_action() is fixed.
propose and is_legal?propose_action(obs) β generation: produce a candidate move. Can be as simple as enumerating all legal board positions, or as complex as a heuristic policy.is_legal_action(obs, action) β verification: return True only if the move is allowed by game rules.is_legal_action returns True for an actually illegal move (false positive), the checker doesn't know the rules β both functions are likely wrong, so both are refined.is_legal_action correctly returns False, only the proposer generated a bad move β only propose_action is fixed.
AutoHarness supports three modes of operation, from lightweight action filtering to a full code-only policy:
| Mode | LLM at test time? | Harness role |
|---|---|---|
| Action Filter | Yes β selects from legal set | Enumerates legal moves only |
| Action Verifier | Yes β proposes, then retries | Validates each proposed move |
| Policy | No | Directly outputs the action |
propose_action() generates a set of legal moves. The LLM then ranks and selects the best one using chain-of-thought reasoning.
The LLM proposes a move. is_legal_action() verifies it. If invalid, the LLM is re-prompted with an "illegal action" warning message. This is the primary approach evaluated in the paper.
The most extreme case: code directly selects the next action β no LLM call needed at test time. Pure Python, near-zero inference cost, highest average performance.
Training uses 10 parallel environments per iteration, rolling out up to 1,000 steps. Rollout terminates whenever an illegal move is made or code execution fails. The Critic consolidates up to 5 failed steps and feeds them to the Refiner to generate improved code. Heuristic weight is set to 1.0 for Thompson sampling. Training ends when the legal action success rate reaches 1.0, or on timeout. Gemini-2.5-Flash is used for all training.
On average, training ends after 14.5 tree search iterations, while 19/32 evaluation games end in less than 10 iterations. Games requiring the most LLM calls to learn are GermanWhist-v0 (43 steps), Cryptarithm-v0 (45 steps), Chess-v0 (64 steps), and Othello-v0 (62 steps). AutoHarness achieves 100% legal action success rate on all 145 games, as shown in Appendix Table 1.
Breakthrough-v0-small (a simplified variant) required 136 steps β more than full Chess. "Small" does not mean simpler rules; the variant's modified capture mechanics created unexpected edge cases.Evaluation focuses on 16 one-player (1P) and 16 two-player (2P) games from TextArena. Three agents are compared: Gemini-2.5-Flash, Gemini-2.5-Pro, and Gemini-2.5-Flash+Harness (our method). The same optimized prompt is used in all experiments. For 1P games, 20 matches are run and average reward is used as the metric. For 2P games, 40 matches are run (split evenly between first/second player), with win/draw/loss rate as the metric.
AutoHarness enables the smaller Gemini-2.5-Flash to win 9/16 two-player games against the much larger Gemini-2.5-Pro (overall win rate 56.3% vs. Pro's 38.2%). Against vanilla Gemini-2.5-Flash (no harness), the win rate rises to 64.8% (12/16 games).
Our approach achieves a higher reward than Gemini-2.5-Pro in 8/16 games, and ties in 5/16 games. Average reward: AutoHarness 0.745 vs. Gemini-2.5-Pro 0.707 vs. Gemini-2.5-Flash 0.673. Notable improvements on Cryptarithm-v0, RushHour-v0, PegJump-v0, and FifteenPuzzle-v0.
As an extreme case, AutoHarness can learn the entire policy as code, dispensing with the need to use an LLM at test time entirely. The policy code uses primitive Python functions and standard libraries (e.g., numpy) β no LLM calls are needed during gameplay. This is evaluated on 16 one-player games (2P games require strategic opponent modeling which is much harder to encode as pure code).
For training, the heuristic value is modified to include task reward: \(H = 0\) if an illegal action is taken, and \(H = 0.5 + 0.5r\) otherwise, where \(r \in [0, 1]\) is the environment reward available at the end of a trajectory. Training uses Gemini-2.5-Flash with up to 256 iterations. On average, training takes 89.4 iterations and achieves a heuristic value of 0.939.
| Agent | Avg. Reward | Test Cost |
|---|---|---|
| Gemini-2.5-Flash | 0.673 | β |
| Gemini-2.5-Pro | 0.707 | β |
| Gemini-2.5-Flash+Harness (Ours) | 0.745 | ~$0 |
| GPT-5.2 | 0.635 | ~$640 |
| GPT-5.2-High | 0.844 | ~$640 |
| Harness-as-Policy (Ours) | 0.870 | ~$0 |
Table 1: All 145 TextArena games, with number of LLM calls needed to learn the harness, and the resulting legal action accuracy. Games marked with * are used for end-to-end agent evaluation.
| Index | Game | # Players | # Learning Steps | Legal Action Rate |
|---|---|---|---|---|
| 0 | 2048-v0 * | 1 | 27 | 1.0 |
| 1 | 2048-v0-easy | 1 | 4 | 1.0 |
| 2 | 2048-v0-extreme | 1 | 44 | 1.0 |
| 3 | 2048-v0-hard | 1 | 47 | 1.0 |
| 4 | 2048-v0-mega-easy | 1 | 31 | 1.0 |
| 5 | 2048-v0-super-easy | 1 | 6 | 1.0 |
| 6 | 2048-v0-ultra-easy | 1 | 2 | 1.0 |
| 7 | 2048-v0-very-easy | 1 | 57 | 1.0 |
| 8 | 2048-v0-very-hard | 1 | 7 | 1.0 |
| 9 | Alquerque-v0 * | 2 | 4 | 1.0 |
| 10 | Bandit-v0 * | 1 | 2 | 1.0 |
| 11 | Bandit-v0-hard | 1 | 1 | 1.0 |
| 12 | Battleship-v0 | 2 | 4 | 1.0 |
| 13 | Battleship-v0-extreme | 2 | 32 | 1.0 |
| 14 | Battleship-v0-large | 2 | 9 | 1.0 |
| 15 | Battleship-v0-standard | 2 | 6 | 1.0 |
| 16 | Blackjack-v0 * | 1 | 2 | 1.0 |
| 17 | Blackjack-v0-long | 1 | 1 | 1.0 |
| 18 | Breakthrough-v0 * | 2 | 2 | 1.0 |
| 19 | Breakthrough-v0-blind | 2 | 20 | 1.0 |
| 20 | Breakthrough-v0-large | 2 | 9 | 1.0 |
| 21 | Breakthrough-v0-long | 2 | 7 | 1.0 |
| 22 | Breakthrough-v0-small | 2 | 136 | 1.0 |
| 23 | Breakthrough-v0-tiny | 2 | 5 | 1.0 |
| 24 | Briscola-v0 | 2 | 2 | 1.0 |
| 25 | Checkers-v0 * | 2 | 7 | 1.0 |
| 26 | Checkers-v0-long | 2 | 3 | 1.0 |
| 27 | Chess-v0 * | 2 | 64 | 1.0 |
| 28 | Chess-v0-blind | 2 | 19 | 1.0 |
| 29 | Chess-v0-long | 2 | 16 | 1.0 |
| 30 | Chopsticks-v0 * | 2 | 15 | 1.0 |
| 31 | Chopsticks-v0-long | 2 | 7 | 1.0 |
| 32 | Chopsticks-v0-medium | 2 | 15 | 1.0 |
| 33 | ColonelBlotto-v0 | 2 | 1 | 1.0 |
| 34 | ColonelBlotto-v0-extreme | 2 | 1 | 1.0 |
| 35 | ColonelBlotto-v0-large | 2 | 1 | 1.0 |
| 36 | ColonelBlotto-v0-small | 2 | 1 | 1.0 |
| 37 | ConnectFour-v0 | 2 | 10 | 1.0 |
| 38 | ConnectFour-v0-blind | 2 | 2 | 1.0 |
| 39 | ConnectFour-v0-large | 2 | 1 | 1.0 |
| 40 | Crusade-v0 * | 2 | 4 | 1.0 |
| 41 | Cryptarithm-v0 * | 1 | 45 | 1.0 |
| 42 | FifteenPuzzle-v0 * | 1 | 3 | 1.0 |
| 43 | FrozenLake-v0 * | 1 | 19 | 1.0 |
| 44 | FrozenLake-v0-hardcore | 1 | 4 | 1.0 |
| 45 | FrozenLake-v0-random | 1 | 22 | 1.0 |
| 46 | GameOfPureStrategy-v0 | 2 | 3 | 1.0 |
| 47 | GermanWhist-v0 * | 2 | 43 | 1.0 |
| 48 | Golf-v0 * | 2 | 8 | 1.0 |
| 49 | Golf-v0-medium | 2 | 9 | 1.0 |
| 50 | GuessTheNumber-v0 * | 1 | 2 | 1.0 |
| 51 | GuessTheNumber-v0-hardcore | 1 | 2 | 1.0 |
| 52 | HighSociety-v0 | 2 | 3 | 1.0 |
| 53 | IndianPoker-v0 | 2 | 11 | 1.0 |
| 54 | IndianPoker-v0-extreme | 2 | 2 | 1.0 |
| 55 | IndianPoker-v0-long | 2 | 26 | 1.0 |
| 56 | IndianPoker-v0-medium | 2 | 7 | 1.0 |
| 57 | IndianPoker-v0-short | 2 | 2 | 1.0 |
| 58 | IteratedMatchingPennies-v0 | 2 | 1 | 1.0 |
| 59 | IteratedRockPaperScissors-v0 | 2 | 1 | 1.0 |
| 60 | IteratedTwoThirdsAverage-v0 | 2 | 1 | 1.0 |
| 61 | KuhnPoker-v0 | 2 | 5 | 1.0 |
| 62 | KuhnPoker-v0-extreme | 2 | 3 | 1.0 |
| 63 | KuhnPoker-v0-long | 2 | 2 | 1.0 |
| 64 | KuhnPoker-v0-medium | 2 | 2 | 1.0 |
| 65 | KuhnPoker-v0-short | 2 | 3 | 1.0 |
| 66 | LiarsDice-v0 * | 2 | 4 | 1.0 |
| 67 | LiarsDice-v0-large | 2 | 6 | 1.0 |
| 68 | LiarsDice-v0-small | 2 | 5 | 1.0 |
| 69 | LightsOut-v0 * | 1 | 1 | 1.0 |
| 70 | LinesOfAction-v0 * | 2 | 23 | 1.0 |
| 71 | Mastermind-v0 * | 1 | 2 | 1.0 |
| 72 | Mastermind-v0-extreme | 1 | 1 | 1.0 |
| 73 | Mastermind-v0-hard | 1 | 2 | 1.0 |
| 74 | MemoryGame-v0 | 2 | 3 | 1.0 |
| 75 | MemoryGame-v0-hard | 2 | 2 | 1.0 |
| 76 | MemoryGame-v0-medium | 2 | 2 | 1.0 |
| 77 | Minesweeper-v0 * | 1 | 11 | 1.0 |
| 78 | Minesweeper-v0-hard | 1 | 6 | 1.0 |
| 79 | Minesweeper-v0-medium | 1 | 10 | 1.0 |
| 80 | Minesweeper-v0-small | 1 | 2 | 1.0 |
| 81 | NewRecruit-v0 * | 2 | 2 | 1.0 |
| 82 | Nim-v0 | 2 | 1 | 1.0 |
| 83 | Nim-v0-large | 2 | 2 | 1.0 |
| 84 | Nim-v0-medium | 2 | 2 | 1.0 |
| 85 | Othello-v0 * | 2 | 62 | 1.0 |
| 86 | Othello-v0-big | 2 | 2 | 1.0 |
| 87 | Othello-v0-hard | 2 | 30 | 1.0 |
| 88 | Othello-v0-huge | 2 | 12 | 1.0 |
| 89 | Othello-v0-small | 2 | 5 | 1.0 |
| 90 | Othello-v0-tiny | 2 | 13 | 1.0 |
| 91 | PegJump-v0 * | 1 | 1 | 1.0 |
| 92 | PigDice-v0 | 2 | 1 | 1.0 |
| 93 | PigDice-v0-100 | 2 | 1 | 1.0 |
| 94 | PigDice-v0-150 | 2 | 1 | 1.0 |
| 95 | PigDice-v0-200 | 2 | 1 | 1.0 |
| 96 | PigDice-v0-250 | 2 | 1 | 1.0 |
| 97 | PigDice-v0-300 | 2 | 1 | 1.0 |
| 98 | PigDice-v0-350 | 2 | 1 | 1.0 |
| 99 | PigDice-v0-400 | 2 | 1 | 1.0 |
| 100 | PigDice-v0-450 | 2 | 1 | 1.0 |
| 101 | PigDice-v0-50 | 2 | 1 | 1.0 |
| 102 | PigDice-v0-500 | 2 | 1 | 1.0 |
| 103 | PigDice-v0-long | 2 | 1 | 1.0 |
| 104 | PigDice-v0-short | 2 | 1 | 1.0 |
| 105 | Poker-v0 | 2 | 17 | 1.0 |
| 106 | Poker-v0-extreme | 2 | 7 | 1.0 |
| 107 | Poker-v0-long | 2 | 5 | 1.0 |
| 108 | Poker-v0-small | 2 | 29 | 1.0 |
| 109 | QuantumTicTacToe-v0 | 2 | 12 | 1.0 |
| 110 | ReverseTicTacToe-v0 | 2 | 3 | 1.0 |
| 111 | RushHour-v0 * | 1 | 3 | 1.0 |
| 112 | SantoriniBaseFixed-v0 | 2 | 30 | 1.0 |
| 113 | Secretary-v0 * | 1 | 1 | 1.0 |
| 114 | Secretary-v0-long | 1 | 1 | 1.0 |
| 115 | SimpleTak-v0 | 2 | 4 | 1.0 |
| 116 | SimpleTak-v0-extreme | 2 | 8 | 1.0 |
| 117 | SimpleTak-v0-large | 2 | 12 | 1.0 |
| 118 | SimpleTak-v0-medium | 2 | 5 | 1.0 |
| 119 | Snake-v0 | 2 | 1 | 1.0 |
| 120 | Snake-v0-large | 2 | 1 | 1.0 |
| 121 | Snake-v0-standard | 2 | 1 | 1.0 |
| 122 | Sokoban-v0 * | 1 | 5 | 1.0 |
| 123 | Sokoban-v0-medium | 1 | 1 | 1.0 |
| 124 | SpiteAndMalice-v0 * | 2 | 33 | 1.0 |
| 125 | Stratego-v0 * | 2 | 23 | 1.0 |
| 126 | Sudoku-v0 * | 1 | 5 | 1.0 |
| 127 | Sudoku-v0-easy | 1 | 5 | 1.0 |
| 128 | Sudoku-v0-hard | 1 | 9 | 1.0 |
| 129 | Sudoku-v0-medium | 1 | 4 | 1.0 |
| 130 | Sudoku-v0-very-easy | 1 | 4 | 1.0 |
| 131 | Surround-v0 | 2 | 1 | 1.0 |
| 132 | Surround-v0-large | 2 | 1 | 1.0 |
| 133 | Surround-v0-standard | 2 | 1 | 1.0 |
| 134 | Tak-v0 * | 2 | 21 | 1.0 |
| 135 | Tak-v0-hard | 2 | 53 | 1.0 |
| 136 | Tak-v0-medium | 2 | 6 | 1.0 |
| 137 | TicTacToe-v0 | 2 | 4 | 1.0 |
| 138 | TowerOfHanoi-v0 * | 1 | 7 | 1.0 |
| 139 | TowerOfHanoi-v0-extreme | 1 | 44 | 1.0 |
| 140 | TowerOfHanoi-v0-hard | 1 | 7 | 1.0 |
| 141 | TowerOfHanoi-v0-hardcore | 1 | 2 | 1.0 |
| 142 | TowerOfHanoi-v0-medium | 1 | 7 | 1.0 |
| 143 | UltimateTicTacToe-v0 * | 2 | 13 | 1.0 |
| 144 | WildTicTacToe-v0 | 2 | 10 | 1.0 |
* Games used for end-to-end evaluation. All 145 games achieve Legal Action Rate = 1.0.
| Game | Gemini-2.5-Flash | Gemini-2.5-Pro | Flash+Harness (Ours) | GPT-5.2 | GPT-5.2-High | Harness-as-Policy (Ours) |
|---|---|---|---|---|---|---|
| 2048-v0 | 0.215 | 0.378 | 0.308 | 0.212 | 0.745 | 0.912 |
| Bandit-v0 | 0.398 | 0.201 | 0.208 | 0.350 | 1.000 | 0.459 |
| Blackjack-v0 | 0.410 | 0.330 | 0.480 | 0.460 | 0.480 | 0.410 |
| Cryptarithm-v0 | 1.000 | 0.950 | 1.000 | 0.600 | 1.000 | 1.000 |
| FifteenPuzzle-v0 | 0.107 | 0.103 | 0.162 | 0.035 | 0.183 | 0.597 |
| FrozenLake-v0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| GuessTheNumber-v0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| LightsOut-v0 | 0.730 | 0.802 | 0.840 | 0.691 | 1.000 | 1.000 |
| Mastermind-v0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Minesweeper-v0 | 0.637 | 0.586 | 0.686 | 0.593 | 1.000 | 0.940 |
| PegJump-v0 | 0.325 | 0.682 | 0.782 | 0.221 | 0.429 | 1.000 |
| RushHour-v0 | 0.688 | 0.887 | 1.000 | 1.000 | 1.000 | 1.000 |
| Secretary-v0 | 0.550 | 0.700 | 0.650 | 0.600 | 0.800 | 0.750 |
| Sokoban-v0 | 0.700 | 0.700 | 0.800 | 0.600 | 0.867 | 0.850 |
| Sudoku-v0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| TowerOfHanoi-v0 | 1.000 | 1.000 | 1.000 | 0.800 | 1.000 | 1.000 |
| Game | Gemini-2.5-Flash | Gemini-2.5-Pro | Flash+Harness (Ours) | GPT-5.2 | GPT-5.2-High | Harness-as-Policy (Ours) |
|---|---|---|---|---|---|---|
| 2048-v0 | 96.57% | 98.36% | 99.86% | 96.05% | 99.94% | 100.00% |
| Bandit-v0 | 99.76% | 96.39% | 99.77% | 100.00% | 100.00% | 100.00% |
| Blackjack-v0 | 99.38% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| Cryptarithm-v0 | 96.97% | 98.70% | 100.00% | 88.44% | 100.00% | 100.00% |
| FifteenPuzzle-v0 | 84.70% | 88.14% | 96.59% | 87.18% | 100.00% | 100.00% |
| FrozenLake-v0 | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| GuessTheNumber-v0 | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| LightsOut-v0 | 100.00% | 100.00% | 99.76% | 100.00% | 100.00% | 100.00% |
| Mastermind-v0 | 100.00% | 100.00% | 100.00% | 98.57% | 100.00% | 100.00% |
| Minesweeper-v0 | 88.69% | 81.20% | 100.00% | 81.10% | 100.00% | 100.00% |
| PegJump-v0 | 67.97% | 83.10% | 98.25% | 60.17% | 77.78% | 100.00% |
| RushHour-v0 | 82.17% | 95.36% | 97.24% | 94.51% | 100.00% | 100.00% |
| Secretary-v0 | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| Sokoban-v0 | 91.89% | 97.11% | 98.48% | 95.88% | 100.00% | 100.00% |
| Sudoku-v0 | 96.77% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
| TowerOfHanoi-v0 | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |
We developed a novel approach for improving LLM agent performance by automatically synthesizing a code harness. Using a small number of iterative refinement rounds guided by Thompson sampling and environment feedback, Gemini-2.5-Flash can generate a robust harness for any given game environment β without any manual engineering.
100% legal action rate achieved across all 145 TextArena games
Smaller Flash model beats larger Pro model β 56.3% win rate in 2P games
Harness-as-Policy achieves reward 0.870, exceeding GPT-5.2-High at near-zero inference cost
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages β we turn any source material into production-quality content. Rich HTML Β· Custom slides Β· Animated video.