AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

Abstract

Language models (LLMs) are remarkably capable at coding and math — but when acting as game-playing agents, they often make illegal moves. In a recent Kaggle chess competition, 78% of Gemini 2.5 Flash's losses were due to rule violations, not bad strategy. People normally write "harnesses" — wrapper code that validates moves — by hand. AutoHarness shows that Gemini 2.5 Flash can write its own harness automatically, using a small number of iterative code refinement rounds. The resulting harness prevents all illegal moves across 145 TextArena games, enabling the smaller Flash model to outperform the larger Gemini 2.5 Pro — while also being more cost-effective.

145

Games Covered

AutoHarness achieved 100% legal action rate across all 145 TextArena games (both 1-player and 2-player), verified on 1,000 test rollouts per game.

56.3%

Smaller Model Wins

Gemini 2.5 Flash + Harness wins 9/16 two-player games against the much larger Gemini 2.5 Pro (overall win rate 56.3% vs. Pro's 38.2%).

0.870

Near-Zero Inference Cost

Harness-as-Policy generates a pure Python policy — no LLM needed at test time. Average reward 0.870 on 16 one-player games, beating GPT-5.2-High (0.844) at near-zero compute cost.

Background & Motivation

LLMs have shown remarkable ability at coding and solving math problems. However, their planning and reasoning performance as agents can be brittle. In the recent Kaggle GameArena chess competition, 78% of losses by Gemini 2.5 Flash were attributed not to bad strategy, but to simple illegal moves — moves strictly prohibited by the rules of chess.

Traditional fixes include manually writing "harness" code to filter invalid moves, or fine-tuning on game trajectories. But manual harnesses are brittle and labor-intensive — requiring new work for every game. Fine-tuning flagship-scale models is expensive and can degrade performance on other tasks. AutoHarness takes a different approach: use the LLM's own code-generation capability to write and refine the harness automatically.

Code as Harness — The Core Idea

An agent is the combination of an LLM and a harness that acts as "glue" between the model and the task. In AutoHarness, the LLM completes the agent by coding its own harness. The harness has two key functions: propose_action(obs) generates candidate moves, and is_legal_action(obs, action) verifies legality. This turns the model into a rejection sampler — it keeps proposing until a legal action is found.

def propose_action(obs):
    # Returns the agent's next move
    ...

def is_legal_action(obs, action):
    # Returns True if the action is legal
    ...

Rejection Sampling with a Harness
The phrase "turns the model into a rejection sampler" refers to a classical probabilistic technique. In standard rejection sampling, you draw candidates from a proposal distribution and accept only those that satisfy a constraint; the rest are discarded and you draw again.

propose_action(obs) acts as the proposal distribution — it generates a candidate move. is_legal_action(obs, action) is the acceptance criterion — it checks whether the move satisfies the game's rules. If the move is rejected, the LLM is re-prompted and proposes again until a legal move is found.

Why this matters: this design guarantees that every move actually executed is legal, regardless of how often the underlying model makes mistakes. The harness handles correctness enforcement; the LLM handles strategic quality.

AutoHarness formulates harness generation as a search over program space, guided by Thompson sampling. The LLM acts as a mutation operator, proposing code refinements based on feedback from environment execution. A tree search balances exploration (trying distinct logic structures) and exploitation (refining a partially working harness).

Thompson Sampling in Program Space
Thompson sampling is a Bayesian exploration strategy originally designed for multi-armed bandit problems. The key idea: maintain a probability distribution over the value of each option, sample one outcome per option, and commit to the option with the highest sampled value. This naturally balances exploration (trying uncertain options) and exploitation (sticking with known-good options).

Applied to code synthesis here: each node in the search tree is a candidate harness version (a program). The "value" of a node is its heuristic score — the fraction of legal moves the code achieves on rollout. Thompson sampling selects which node to refine next: nodes with high uncertainty get selected occasionally even if their mean score is low (exploration), while consistently high-scoring nodes are selected more often (exploitation).

Why not greedy search? Pure greedy search would always refine the currently best-scoring code, missing qualitatively different program structures that might score poorly at first but converge to a better solution. Thompson sampling avoids this local-optimum trap.

How AutoHarness Works

AutoHarness maintains multiple code hypotheses in a tree structure, using Thompson sampling to choose which node to refine next. The heuristic value for each node is the average legal move accuracy achieved by that code version. When the code has a bug — is_legal_action() returns True but the move is actually illegal — both propose_action() and is_legal_action() are refined. When only is_legal_action() returns False (correctly detecting an illegal move), only propose_action() is fixed.

The Two-Function Split: Why Separate propose and is_legal?

propose_action(obs) — generation: produce a candidate move. Can be as simple as enumerating all legal board positions, or as complex as a heuristic policy.
is_legal_action(obs, action) — verification: return True only if the move is allowed by game rules.

Targeted repair logic:
• If is_legal_action returns True for an actually illegal move (false positive), the checker doesn't know the rules — both functions are likely wrong, so both are refined.
• If is_legal_action correctly returns False, only the proposer generated a bad move — only propose_action is fixed.

This asymmetric repair avoids "fixing" code that is already correct, reducing unnecessary churn in later iterations.

AutoHarness code-as-harness learning process diagram — Figure 1: The code-as-harness learning process. Left: Thompson sampling search tree with heuristic values. Right: iterative refinement loop (Old Code → Refiner ← Critic ← Evaluator ← Rollout → New Code → New H).

AutoHarness supports three modes of operation, from lightweight action filtering to a full code-only policy:

Three Modes: A Spectrum from "Safety Net" to "Full Autonomy"
The three AutoHarness operating modes differ in how much the LLM is involved at test time (after the harness is learned):

Mode	LLM at test time?	Harness role
Action Filter	Yes — selects from legal set	Enumerates legal moves only
Action Verifier	Yes — proposes, then retries	Validates each proposed move
Policy	No	Directly outputs the action

The key distinction is inference cost: Action Verifier may call the LLM multiple times per turn, while Harness-as-Policy never calls the LLM at game time. The "near-zero inference cost" headline refers exclusively to the Policy mode.

Harness-as-Action-Filter

propose_action() generates a set of legal moves. The LLM then ranks and selects the best one using chain-of-thought reasoning.

Harness-as-Action-Verifier (Main Method)

The LLM proposes a move. is_legal_action() verifies it. If invalid, the LLM is re-prompted with an "illegal action" warning message. This is the primary approach evaluated in the paper.

Harness-as-Policy

The most extreme case: code directly selects the next action — no LLM call needed at test time. Pure Python, near-zero inference cost, highest average performance.

Training: Verified on 145 Games

Training uses 10 parallel environments per iteration, rolling out up to 1,000 steps. Rollout terminates whenever an illegal move is made or code execution fails. The Critic consolidates up to 5 failed steps and feeds them to the Refiner to generate improved code. Heuristic weight is set to 1.0 for Thompson sampling. Training ends when the legal action success rate reaches 1.0, or on timeout. Gemini-2.5-Flash is used for all training.

Critic and Refiner: The Inner Feedback Loop
The training procedure uses three distinct roles:

Evaluator — runs the current code in 10 parallel game environments for up to 1,000 steps each. Collects failure traces (illegal moves, exceptions).
Critic — reads up to 5 failed steps and writes a structured diagnosis: what rule is being violated, which function is at fault, what the expected behavior should be.
Refiner — reads the Critic's diagnosis alongside the current code, and writes a new candidate version.

Why consolidate failures before the Refiner sees them? The Critic compresses the signal — it identifies the root cause pattern across multiple failures rather than forwarding raw rollout noise. Parallel environments: 10 games run simultaneously to collect diverse failure modes in a single iteration.

14.5 Avg. iterations

145 Games covered

100% Legal action rate

19/32 Converge < 10 iter

Learning convergence curves for 6 games — Figure 2: Fraction of legal moves vs. number of code synthesis iterations for a selection of 6 games. Most games converge rapidly; complex games like Chess and Othello require more iterations.

On average, training ends after 14.5 tree search iterations, while 19/32 evaluation games end in less than 10 iterations. Games requiring the most LLM calls to learn are GermanWhist-v0 (43 steps), Cryptarithm-v0 (45 steps), Chess-v0 (64 steps), and Othello-v0 (62 steps). AutoHarness achieves 100% legal action success rate on all 145 games, as shown in Appendix Table 1.

Why Do Some Games Require Many More Iterations?
The number of LLM calls varies enormously (1 call for simple games vs. 64 for Chess). Three factors drive difficulty:

Rule complexity: Chess has ~30 move types with state dependencies (castling, en passant, check detection). GermanWhist requires tracking partial information about opponents' hands.
Hidden state: Games with private hands require the harness to reason about unobserved state, harder to encode in pure Python.
Counter-intuitive case: Breakthrough-v0-small (a simplified variant) required 136 steps — more than full Chess. "Small" does not mean simpler rules; the variant's modified capture mechanics created unexpected edge cases.

Metric note: "# Learning Steps" = number of tree search iterations = number of LLM refinement calls. Each step = one Critic call + one Refiner call.

Evaluation: A Smaller Model Beats a Larger One

Evaluation focuses on 16 one-player (1P) and 16 two-player (2P) games from TextArena. Three agents are compared: Gemini-2.5-Flash, Gemini-2.5-Pro, and Gemini-2.5-Flash+Harness (our method). The same optimized prompt is used in all experiments. For 1P games, 20 matches are run and average reward is used as the metric. For 2P games, 40 matches are run (split evenly between first/second player), with win/draw/loss rate as the metric.

Evaluation Protocol: What "Win Rate" and "Average Reward" Actually Mean
Two-player (2P) games — Win/Draw/Loss rate:
40 matches per game (20 as first player + 20 as second player) to control for first-mover advantage. "Win rate" is the fraction of matches where Flash+Harness wins against Gemini-2.5-Pro. The headline "9/16 games" means Flash+Harness achieves a positive win margin in 9 of the 16 game titles.

One-player (1P) games — Average reward:
20 matches per game. Reward r ∈ [0, 1] is a normalized score from the TextArena environment. r = 1.0 means perfect completion (e.g., solved a puzzle); r = 0.0 means failure. Games where all agents score 1.0 (GuessTheNumber, FrozenLake, etc.) are saturated benchmarks — they don't differentiate agent quality.

Two-Player Games

Win/lose/draw rate vs Gemini-2.5-Pro for 16 2P games — Figure 3: Win/lose/draw rate of AutoHarness (Gemini-2.5-Flash+Harness) vs. Gemini-2.5-Pro for each of the 16 two-player games. Green = wins, gray = draws, red = losses.

AutoHarness enables the smaller Gemini-2.5-Flash to win 9/16 two-player games against the much larger Gemini-2.5-Pro (overall win rate 56.3% vs. Pro's 38.2%). Against vanilla Gemini-2.5-Flash (no harness), the win rate rises to 64.8% (12/16 games).

One-Player Games

Average reward vs Gemini-2.5-Pro for 16 1P games — Figure 4: Average reward of AutoHarness (orange) vs. Gemini-2.5-Pro (blue) for each of the 16 one-player games. AutoHarness exceeds Pro in 8/16 games and ties in 5/16.

Our approach achieves a higher reward than Gemini-2.5-Pro in 8/16 games, and ties in 5/16 games. Average reward: AutoHarness 0.745 vs. Gemini-2.5-Pro 0.707 vs. Gemini-2.5-Flash 0.673. Notable improvements on Cryptarithm-v0, RushHour-v0, PegJump-v0, and FifteenPuzzle-v0.

Harness-as-Policy: Zero Inference Cost, Highest Performance

As an extreme case, AutoHarness can learn the entire policy as code, dispensing with the need to use an LLM at test time entirely. The policy code uses primitive Python functions and standard libraries (e.g., numpy) — no LLM calls are needed during gameplay. This is evaluated on 16 one-player games (2P games require strategic opponent modeling which is much harder to encode as pure code).

For training, the heuristic value is modified to include task reward: $H = 0$ if an illegal action is taken, and $H = 0.5 + 0.5r$ otherwise, where $r \in [0, 1]$ is the environment reward available at the end of a trajectory. Training uses Gemini-2.5-Flash with up to 256 iterations. On average, training takes 89.4 iterations and achieves a heuristic value of 0.939.

Heuristic Value Design for Harness-as-Policy
The heuristic H combines two objectives:

H = 0 if an illegal action is taken
H = 0.5 + 0.5r otherwise (r ∈ [0, 1] is the environment reward)

Why this specific formula?

Hard zero for illegal moves: an illegal action is an absolute failure. H = 0 ensures these nodes are never selected by Thompson sampling for exploitation.
0.5 floor for legal-but-low-reward: even a policy scoring r = 0 gets H = 0.5, keeping it above zero. Thompson sampling will still occasionally explore from this node — the code structure might still be on the right track.
Linear scaling: H ranges from 0.5 (legal, zero reward) to 1.0 (legal, perfect reward). The search simultaneously optimizes legality and task performance.

Contrast with Action Verifier training: there, H was just the legal action accuracy. Including r here is necessary because the policy code is the strategy — no LLM handles the strategic part separately.

Figure 5: Average reward of different agents across 16 TextArena 1P games. Our two methods (bold labels) achieve the highest performance.

Reading Figure 5: What the Comparison Reveals
Figure 5 compares average reward across 16 one-player games for six agents. Key observations:

Harness-as-Policy (0.870) > GPT-5.2-High (0.844): a pure Python script with no LLM call at test time outperforms one of the largest available models — the central counter-intuitive result.
Flash+Harness (0.745) > Gemini-2.5-Pro (0.707): the Action Verifier also beats the larger model, though the margin is smaller since the LLM still handles strategy.
GPT-5.2 (0.635) < Gemini-2.5-Flash (0.673): GPT-5.2 without a harness underperforms baseline Flash, likely due to higher illegal-move rates on complex games.
Cost: GPT-5.2-High costs ~$640 for the evaluation run. Harness-as-Policy costs ~$0 at test time (training cost is amortized once per game).

Performance vs. Cost Comparison

Agent	Avg. Reward	Test Cost
Gemini-2.5-Flash	0.673	—
Gemini-2.5-Pro	0.707	—
Gemini-2.5-Flash+Harness (Ours)	0.745	~$0
GPT-5.2	0.635	~$640
GPT-5.2-High	0.844	~$640
Harness-as-Policy (Ours)	0.870	~$0

Appendix: Full Results

All 145 TextArena Games — Learning Steps & Legal Action Rate ▾

Table 1: All 145 TextArena games, with number of LLM calls needed to learn the harness, and the resulting legal action accuracy. Games marked with * are used for end-to-end agent evaluation.

Index	Game	# Players	# Learning Steps	Legal Action Rate
0	2048-v0 *	1	27	1.0
1	2048-v0-easy	1	4	1.0
2	2048-v0-extreme	1	44	1.0
3	2048-v0-hard	1	47	1.0
4	2048-v0-mega-easy	1	31	1.0
5	2048-v0-super-easy	1	6	1.0
6	2048-v0-ultra-easy	1	2	1.0
7	2048-v0-very-easy	1	57	1.0
8	2048-v0-very-hard	1	7	1.0
9	Alquerque-v0 *	2	4	1.0
10	Bandit-v0 *	1	2	1.0
11	Bandit-v0-hard	1	1	1.0
12	Battleship-v0	2	4	1.0
13	Battleship-v0-extreme	2	32	1.0
14	Battleship-v0-large	2	9	1.0
15	Battleship-v0-standard	2	6	1.0
16	Blackjack-v0 *	1	2	1.0
17	Blackjack-v0-long	1	1	1.0
18	Breakthrough-v0 *	2	2	1.0
19	Breakthrough-v0-blind	2	20	1.0
20	Breakthrough-v0-large	2	9	1.0
21	Breakthrough-v0-long	2	7	1.0
22	Breakthrough-v0-small	2	136	1.0
23	Breakthrough-v0-tiny	2	5	1.0
24	Briscola-v0	2	2	1.0
25	Checkers-v0 *	2	7	1.0
26	Checkers-v0-long	2	3	1.0
27	Chess-v0 *	2	64	1.0
28	Chess-v0-blind	2	19	1.0
29	Chess-v0-long	2	16	1.0
30	Chopsticks-v0 *	2	15	1.0
31	Chopsticks-v0-long	2	7	1.0
32	Chopsticks-v0-medium	2	15	1.0
33	ColonelBlotto-v0	2	1	1.0
34	ColonelBlotto-v0-extreme	2	1	1.0
35	ColonelBlotto-v0-large	2	1	1.0
36	ColonelBlotto-v0-small	2	1	1.0
37	ConnectFour-v0	2	10	1.0
38	ConnectFour-v0-blind	2	2	1.0
39	ConnectFour-v0-large	2	1	1.0
40	Crusade-v0 *	2	4	1.0
41	Cryptarithm-v0 *	1	45	1.0
42	FifteenPuzzle-v0 *	1	3	1.0
43	FrozenLake-v0 *	1	19	1.0
44	FrozenLake-v0-hardcore	1	4	1.0
45	FrozenLake-v0-random	1	22	1.0
46	GameOfPureStrategy-v0	2	3	1.0
47	GermanWhist-v0 *	2	43	1.0
48	Golf-v0 *	2	8	1.0
49	Golf-v0-medium	2	9	1.0
50	GuessTheNumber-v0 *	1	2	1.0
51	GuessTheNumber-v0-hardcore	1	2	1.0
52	HighSociety-v0	2	3	1.0
53	IndianPoker-v0	2	11	1.0
54	IndianPoker-v0-extreme	2	2	1.0
55	IndianPoker-v0-long	2	26	1.0
56	IndianPoker-v0-medium	2	7	1.0
57	IndianPoker-v0-short	2	2	1.0
58	IteratedMatchingPennies-v0	2	1	1.0
59	IteratedRockPaperScissors-v0	2	1	1.0
60	IteratedTwoThirdsAverage-v0	2	1	1.0
61	KuhnPoker-v0	2	5	1.0
62	KuhnPoker-v0-extreme	2	3	1.0
63	KuhnPoker-v0-long	2	2	1.0
64	KuhnPoker-v0-medium	2	2	1.0
65	KuhnPoker-v0-short	2	3	1.0
66	LiarsDice-v0 *	2	4	1.0
67	LiarsDice-v0-large	2	6	1.0
68	LiarsDice-v0-small	2	5	1.0
69	LightsOut-v0 *	1	1	1.0
70	LinesOfAction-v0 *	2	23	1.0
71	Mastermind-v0 *	1	2	1.0
72	Mastermind-v0-extreme	1	1	1.0
73	Mastermind-v0-hard	1	2	1.0
74	MemoryGame-v0	2	3	1.0
75	MemoryGame-v0-hard	2	2	1.0
76	MemoryGame-v0-medium	2	2	1.0
77	Minesweeper-v0 *	1	11	1.0
78	Minesweeper-v0-hard	1	6	1.0
79	Minesweeper-v0-medium	1	10	1.0
80	Minesweeper-v0-small	1	2	1.0
81	NewRecruit-v0 *	2	2	1.0
82	Nim-v0	2	1	1.0
83	Nim-v0-large	2	2	1.0
84	Nim-v0-medium	2	2	1.0
85	Othello-v0 *	2	62	1.0
86	Othello-v0-big	2	2	1.0
87	Othello-v0-hard	2	30	1.0
88	Othello-v0-huge	2	12	1.0
89	Othello-v0-small	2	5	1.0
90	Othello-v0-tiny	2	13	1.0
91	PegJump-v0 *	1	1	1.0
92	PigDice-v0	2	1	1.0
93	PigDice-v0-100	2	1	1.0
94	PigDice-v0-150	2	1	1.0
95	PigDice-v0-200	2	1	1.0
96	PigDice-v0-250	2	1	1.0
97	PigDice-v0-300	2	1	1.0
98	PigDice-v0-350	2	1	1.0
99	PigDice-v0-400	2	1	1.0
100	PigDice-v0-450	2	1	1.0
101	PigDice-v0-50	2	1	1.0
102	PigDice-v0-500	2	1	1.0
103	PigDice-v0-long	2	1	1.0
104	PigDice-v0-short	2	1	1.0
105	Poker-v0	2	17	1.0
106	Poker-v0-extreme	2	7	1.0
107	Poker-v0-long	2	5	1.0
108	Poker-v0-small	2	29	1.0
109	QuantumTicTacToe-v0	2	12	1.0
110	ReverseTicTacToe-v0	2	3	1.0
111	RushHour-v0 *	1	3	1.0
112	SantoriniBaseFixed-v0	2	30	1.0
113	Secretary-v0 *	1	1	1.0
114	Secretary-v0-long	1	1	1.0
115	SimpleTak-v0	2	4	1.0
116	SimpleTak-v0-extreme	2	8	1.0
117	SimpleTak-v0-large	2	12	1.0
118	SimpleTak-v0-medium	2	5	1.0
119	Snake-v0	2	1	1.0
120	Snake-v0-large	2	1	1.0
121	Snake-v0-standard	2	1	1.0
122	Sokoban-v0 *	1	5	1.0
123	Sokoban-v0-medium	1	1	1.0
124	SpiteAndMalice-v0 *	2	33	1.0
125	Stratego-v0 *	2	23	1.0
126	Sudoku-v0 *	1	5	1.0
127	Sudoku-v0-easy	1	5	1.0
128	Sudoku-v0-hard	1	9	1.0
129	Sudoku-v0-medium	1	4	1.0
130	Sudoku-v0-very-easy	1	4	1.0
131	Surround-v0	2	1	1.0
132	Surround-v0-large	2	1	1.0
133	Surround-v0-standard	2	1	1.0
134	Tak-v0 *	2	21	1.0
135	Tak-v0-hard	2	53	1.0
136	Tak-v0-medium	2	6	1.0
137	TicTacToe-v0	2	4	1.0
138	TowerOfHanoi-v0 *	1	7	1.0
139	TowerOfHanoi-v0-extreme	1	44	1.0
140	TowerOfHanoi-v0-hard	1	7	1.0
141	TowerOfHanoi-v0-hardcore	1	2	1.0
142	TowerOfHanoi-v0-medium	1	7	1.0
143	UltimateTicTacToe-v0 *	2	13	1.0
144	WildTicTacToe-v0	2	10	1.0

* Games used for end-to-end evaluation. All 145 games achieve Legal Action Rate = 1.0.

Per-Game Average Reward (1P Games)

Game	Gemini-2.5-Flash	Gemini-2.5-Pro	Flash+Harness (Ours)	GPT-5.2	GPT-5.2-High	Harness-as-Policy (Ours)
2048-v0	0.215	0.378	0.308	0.212	0.745	0.912
Bandit-v0	0.398	0.201	0.208	0.350	1.000	0.459
Blackjack-v0	0.410	0.330	0.480	0.460	0.480	0.410
Cryptarithm-v0	1.000	0.950	1.000	0.600	1.000	1.000
FifteenPuzzle-v0	0.107	0.103	0.162	0.035	0.183	0.597
FrozenLake-v0	1.000	1.000	1.000	1.000	1.000	1.000
GuessTheNumber-v0	1.000	1.000	1.000	1.000	1.000	1.000
LightsOut-v0	0.730	0.802	0.840	0.691	1.000	1.000
Mastermind-v0	1.000	1.000	1.000	1.000	1.000	1.000
Minesweeper-v0	0.637	0.586	0.686	0.593	1.000	0.940
PegJump-v0	0.325	0.682	0.782	0.221	0.429	1.000
RushHour-v0	0.688	0.887	1.000	1.000	1.000	1.000
Secretary-v0	0.550	0.700	0.650	0.600	0.800	0.750
Sokoban-v0	0.700	0.700	0.800	0.600	0.867	0.850
Sudoku-v0	1.000	1.000	1.000	1.000	1.000	1.000
TowerOfHanoi-v0	1.000	1.000	1.000	0.800	1.000	1.000

Per-Game Legal Action Rate (1P Games)

Game	Gemini-2.5-Flash	Gemini-2.5-Pro	Flash+Harness (Ours)	GPT-5.2	GPT-5.2-High	Harness-as-Policy (Ours)
2048-v0	96.57%	98.36%	99.86%	96.05%	99.94%	100.00%
Bandit-v0	99.76%	96.39%	99.77%	100.00%	100.00%	100.00%
Blackjack-v0	99.38%	100.00%	100.00%	100.00%	100.00%	100.00%
Cryptarithm-v0	96.97%	98.70%	100.00%	88.44%	100.00%	100.00%
FifteenPuzzle-v0	84.70%	88.14%	96.59%	87.18%	100.00%	100.00%
FrozenLake-v0	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
GuessTheNumber-v0	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
LightsOut-v0	100.00%	100.00%	99.76%	100.00%	100.00%	100.00%
Mastermind-v0	100.00%	100.00%	100.00%	98.57%	100.00%	100.00%
Minesweeper-v0	88.69%	81.20%	100.00%	81.10%	100.00%	100.00%
PegJump-v0	67.97%	83.10%	98.25%	60.17%	77.78%	100.00%
RushHour-v0	82.17%	95.36%	97.24%	94.51%	100.00%	100.00%
Secretary-v0	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Sokoban-v0	91.89%	97.11%	98.48%	95.88%	100.00%	100.00%
Sudoku-v0	96.77%	100.00%	100.00%	100.00%	100.00%	100.00%
TowerOfHanoi-v0	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%

Conclusion & Future Work

We developed a novel approach for improving LLM agent performance by automatically synthesizing a code harness. Using a small number of iterative refinement rounds guided by Thompson sampling and environment feedback, Gemini-2.5-Flash can generate a robust harness for any given game environment — without any manual engineering.

Quick Recap: Key Terms Used Throughout

Code harness — wrapper code around an LLM agent that enforces game rules by filtering or verifying proposed moves.
Thompson sampling — a Bayesian exploration-exploitation strategy used to select which code candidate to refine next (introduced in §Method).
Rejection sampler — the architectural pattern where the LLM proposes moves and the harness rejects illegal ones until a valid move is found.
Harness-as-Policy — the variant where the synthesized Python code replaces the LLM entirely at test time.
TextArena — the open-source multi-game text-based environment used for all experiments (Guertler et al., 2025).

✅

100% legal action rate achieved across all 145 TextArena games

✅

Smaller Flash model beats larger Pro model — 56.3% win rate in 2P games

✅

Harness-as-Policy achieves reward 0.870, exceeding GPT-5.2-High at near-zero inference cost

Future Directions

Distill resulting domain-specific expert agents back into the base LLM, making the whole system recursively self-improving
Build a library of reusable harnesses that can be shared across related game environments
Apply the method to more challenging multimodal games such as Craftax and Terra Nova

References (19) ▾

Chervonyi et al. (2025). Gold-medalist performance in solving olympiad geometry with alphageometry2. JMLR, 26(241):1–39.
Duan et al. (2024). GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations. arXiv [cs.CL].
Guertler et al. (2025). Textarena. arXiv:2504.11442.
Huang & Yang (2025). Winning gold at IMO 2025 with a model-agnostic verification-and-refinement pipeline. arXiv:2507.15855.
Kaggle (2025). Kaggle game arena: A benchmarking platform for AI models. kaggle.com/game-arena.
Kokel et al. (2025). ACPBench hard: Unrestrained reasoning about action, change, and planning. AAAI 2025 Workshop LM4Plan.
Lehrach et al. (2025). Code world models for general game playing. arXiv:2510.04542.
Li et al. (2022). Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097.
Liang et al. (2023). Code as policies: Language model programs for embodied control. In ICRA, pp. 9493–9500.
Ma et al. (2024). Eureka: Human-level reward design via coding large language models. In ICLR 2024.
Novikov et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131.
Petrov et al. (2025). Proof or bluff? Evaluating LLMs on 2025 USA math olympiad. arXiv:2503.21934.
Ruoss et al. (2024). LMAct: A benchmark for in-context imitation learning with long multimodal demonstrations. arxiv.org/abs/2412.01441.
Shinn et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS, 36:8634–8652.
Tang et al. (2024). Code repair with LLMs gives an exploration-exploitation tradeoff. NeurIPS, 37:117954–117996.
Valmeekam et al. (2023a). On the planning abilities of large language models — a critical investigation. In NeurIPS.
Valmeekam et al. (2023b). On the planning abilities of large language models-a critical investigation. NeurIPS, 36:75993–76005.
Wang et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837.