---
arxiv_id: 2603.03329
title: "AutoHarness: improving LLM agents by automatically synthesizing a code harness"
authors:
  - Xinghua Lou
  - Miguel Lázaro-Gredilla
  - Antoine Dedieu
  - Carter Wendelken
  - Wolfgang Lehrach
  - Kevin P. Murphy
difficulty: Intermediate
tags:
  - Agent
  - LLM
  - Reasoning
  - Benchmark
published_at: 2026-02-10
flecto_url: https://flecto.zer0ai.dev/papers/2603.03329/
lang: en
---

> AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

**Authors**: Can a smaller model beat a larger one — just by writing its own rules?

## Abstract

### Abstract

Language models (LLMs) are remarkably capable at coding and math — but when acting as game-playing agents, they often make illegal moves . In a recent Kaggle chess competition, 78% of Gemini 2.5 Flash's losses were due to rule violations, not bad strategy. People normally write "harnesses" — wrapper code that validates moves — by hand. AutoHarness shows that Gemini 2.5 Flash can write its own harness automatically , using a small number of iterative code refinement rounds. The resulting harness prevents all illegal moves across 145 TextArena games , enabling the smaller Flash model to outperform the larger Gemini 2.5 Pro — while also being more cost-effective.

## Introduction

### Background & Motivation

## Conclusion

### Conclusion & Future Work

## References

### References (19) ▾

## Head

### AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

## Abstract, Card=1

### Games Covered

AutoHarness achieved 100% legal action rate across all 145 TextArena games (both 1-player and 2-player), verified on 1,000 test rollouts per game.

## Abstract, Card=2

### Smaller Model Wins

Gemini 2.5 Flash + Harness wins 9/16 two-player games against the much larger Gemini 2.5 Pro (overall win rate 56.3% vs. Pro's 38.2%).

## Abstract, Card=3

### Near-Zero Inference Cost

Harness-as-Policy generates a pure Python policy — no LLM needed at test time. Average reward 0.870 on 16 one-player games, beating GPT-5.2-High (0.844) at near-zero compute cost.

## Introduction, Para=1

LLMs have shown remarkable ability at coding and solving math problems. However, their planning and reasoning performance as agents can be brittle. In the recent Kaggle GameArena chess competition, 78% of losses by Gemini 2.5 Flash were attributed not to bad strategy, but to simple illegal moves — moves strictly prohibited by the rules of chess.

## Introduction, Para=2

Traditional fixes include manually writing "harness" code to filter invalid moves, or fine-tuning on game trajectories. But manual harnesses are brittle and labor-intensive — requiring new work for every game. Fine-tuning flagship-scale models is expensive and can degrade performance on other tasks. AutoHarness takes a different approach : use the LLM's own code-generation capability to write and refine the harness automatically.

## Introduction, Callout=Insight

### Code as Harness — The Core Idea

An agent is the combination of an LLM and a harness that acts as "glue" between the model and the task. In AutoHarness, the LLM completes the agent by coding its own harness . The harness has two key functions: propose_action(obs) generates candidate moves, and is_legal_action(obs, action) verifies legality. This turns the model into a rejection sampler — it keeps proposing until a legal action is found.

## Introduction, Para=3

AutoHarness formulates harness generation as a search over program space , guided by Thompson sampling. The LLM acts as a mutation operator, proposing code refinements based on feedback from environment execution. A tree search balances exploration (trying distinct logic structures) and exploitation (refining a partially working harness).

## Method

### How AutoHarness Works

## Method, Para=1

AutoHarness maintains multiple code hypotheses in a tree structure, using Thompson sampling to choose which node to refine next. The heuristic value for each node is the average legal move accuracy achieved by that code version. When the code has a bug — is_legal_action() returns True but the move is actually illegal — both propose_action() and is_legal_action() are refined. When only is_legal_action() returns False (correctly detecting an illegal move), only propose_action() is fixed.

## Method, Figure=1

Figure 1: The code-as-harness learning process. Left: Thompson sampling search tree with heuristic values. Right: iterative refinement loop (Old Code → Refiner ← Critic ← Evaluator ← Rollout → New Code → New H).

## Method, Para=2

AutoHarness supports three modes of operation, from lightweight action filtering to a full code-only policy:

## Method, Card=Filter

### Harness-as-Action-Filter

propose_action() generates a set of legal moves. The LLM then ranks and selects the best one using chain-of-thought reasoning.

## Method, Card=Verifier

### Harness-as-Action-Verifier (Main Method)

The LLM proposes a move. is_legal_action() verifies it. If invalid, the LLM is re-prompted with an "illegal action" warning message. This is the primary approach evaluated in the paper.

## Method, Card=Policy

### Harness-as-Policy

The most extreme case: code directly selects the next action — no LLM call needed at test time. Pure Python, near-zero inference cost, highest average performance.

## Training

### Training: Verified on 145 Games

## Training, Para=1

Training uses 10 parallel environments per iteration, rolling out up to 1,000 steps. Rollout terminates whenever an illegal move is made or code execution fails. The Critic consolidates up to 5 failed steps and feeds them to the Refiner to generate improved code. Heuristic weight is set to 1.0 for Thompson sampling. Training ends when the legal action success rate reaches 1.0, or on timeout. Gemini-2.5-Flash is used for all training.

## Training, Metric=Avg_Iter

### Avg. iterations

## Training, Metric=Games

### Games covered

## Training, Metric=Legal_Rate

### Legal action rate

## Training, Metric=Fast_Conv

### Converge < 10 iter

## Training, Figure=2

Figure 2: Fraction of legal moves vs. number of code synthesis iterations for a selection of 6 games. Most games converge rapidly; complex games like Chess and Othello require more iterations.

## Training, Para=2

On average, training ends after 14.5 tree search iterations , while 19/32 evaluation games end in less than 10 iterations. Games requiring the most LLM calls to learn are GermanWhist-v0 (43 steps), Cryptarithm-v0 (45 steps), Chess-v0 (64 steps), and Othello-v0 (62 steps). AutoHarness achieves 100% legal action success rate on all 145 games, as shown in Appendix Table 1.

## Evaluation

### Evaluation: A Smaller Model Beats a Larger One

## Evaluation, Para=1

Evaluation focuses on 16 one-player (1P) and 16 two-player (2P) games from TextArena. Three agents are compared: Gemini-2.5-Flash, Gemini-2.5-Pro, and Gemini-2.5-Flash+Harness (our method). The same optimized prompt is used in all experiments. For 1P games, 20 matches are run and average reward is used as the metric. For 2P games, 40 matches are run (split evenly between first/second player), with win/draw/loss rate as the metric.

## Evaluation, Sub=2P

### Two-Player Games

## Evaluation, Figure=3

Figure 3: Win/lose/draw rate of AutoHarness (Gemini-2.5-Flash+Harness) vs. Gemini-2.5-Pro for each of the 16 two-player games. Green = wins, gray = draws, red = losses.

## Evaluation, Note=2P

AutoHarness enables the smaller Gemini-2.5-Flash to win 9/16 two-player games against the much larger Gemini-2.5-Pro (overall win rate 56.3% vs. Pro's 38.2%). Against vanilla Gemini-2.5-Flash (no harness), the win rate rises to 64.8% (12/16 games).

## Evaluation, Sub=1P

### One-Player Games

## Evaluation, Figure=4

Figure 4: Average reward of AutoHarness (orange) vs. Gemini-2.5-Pro (blue) for each of the 16 one-player games. AutoHarness exceeds Pro in 8/16 games and ties in 5/16.

## Evaluation, Note=1P

Our approach achieves a higher reward than Gemini-2.5-Pro in 8/16 games , and ties in 5/16 games. Average reward: AutoHarness 0.745 vs. Gemini-2.5-Pro 0.707 vs. Gemini-2.5-Flash 0.673. Notable improvements on Cryptarithm-v0, RushHour-v0, PegJump-v0, and FifteenPuzzle-v0.

## Policy

### Harness-as-Policy: Zero Inference Cost, Highest Performance

## Policy, Para=1

As an extreme case, AutoHarness can learn the entire policy as code , dispensing with the need to use an LLM at test time entirely. The policy code uses primitive Python functions and standard libraries (e.g., numpy) — no LLM calls are needed during gameplay. This is evaluated on 16 one-player games (2P games require strategic opponent modeling which is much harder to encode as pure code).

## Policy, Para=2

For training, the heuristic value is modified to include task reward: \(H = 0\) if an illegal action is taken, and \(H = 0.5 + 0.5r\) otherwise, where \(r \in [0, 1]\) is the environment reward available at the end of a trajectory. Training uses Gemini-2.5-Flash with up to 256 iterations. On average, training takes 89.4 iterations and achieves a heuristic value of 0.939.

## Policy, Figure=5

Figure 5: Average reward of different agents across 16 TextArena 1P games. Our two methods (bold labels) achieve the highest performance.

## Policy, Panel=Cost

### Performance vs. Cost Comparison

## Appendix

### Appendix: Full Results

## Appendix, Accordion=Games

### All 145 TextArena Games — Learning Steps & Legal Action Rate ▾

## Appendix, Table=Games, Intro

Table 1: All 145 TextArena games, with number of LLM calls needed to learn the harness, and the resulting legal action accuracy. Games marked with * are used for end-to-end agent evaluation.

## Appendix, Table=Games

### * Games used for end-to-end evaluation. All 145 games achieve Legal Action Rate = 1.0.

## Appendix, Sub=Reward

### Per-Game Average Reward (1P Games)

## Appendix, Sub=Legal

### Per-Game Legal Action Rate (1P Games)

## Conclusion, Para=1

We developed a novel approach for improving LLM agent performance by automatically synthesizing a code harness . Using a small number of iterative refinement rounds guided by Thompson sampling and environment feedback, Gemini-2.5-Flash can generate a robust harness for any given game environment — without any manual engineering.

## Conclusion, Point=1

### 100% legal action rate achieved across all 145 TextArena games

## Conclusion, Point=2

### Smaller Flash model beats larger Pro model — 56.3% win rate in 2P games

## Conclusion, Point=3

### Harness-as-Policy achieves reward 0.870 , exceeding GPT-5.2-High at near-zero inference cost

## Conclusion, Future=1

### Future Directions

## Conclusion, Future=Item1

Distill resulting domain-specific expert agents back into the base LLM, making the whole system recursively self-improving

## Conclusion, Future=Item2

### Build a library of reusable harnesses that can be shared across related game environments

## Conclusion, Future=Item3

### Apply the method to more challenging multimodal games such as Craftax and Terra Nova

## References, Ref=1

Chervonyi et al. (2025). Gold-medalist performance in solving olympiad geometry with alphageometry2. JMLR , 26(241):1–39.

## References, Ref=2

Duan et al. (2024). GTBench: Uncovering the strategic reasoning limitations of LLMs via game-theoretic evaluations. arXiv [cs.CL] .

## References, Ref=3

### Guertler et al. (2025). Textarena. arXiv:2504.11442 .

## References, Ref=4

Huang & Yang (2025). Winning gold at IMO 2025 with a model-agnostic verification-and-refinement pipeline. arXiv:2507.15855 .

## References, Ref=5

### Kaggle (2025). Kaggle game arena: A benchmarking platform for AI models. kaggle.com/game-arena .

## References, Ref=6

Kokel et al. (2025). ACPBench hard: Unrestrained reasoning about action, change, and planning. AAAI 2025 Workshop LM4Plan .

## References, Ref=7

### Lehrach et al. (2025). Code world models for general game playing. arXiv:2510.04542 .

## References, Ref=8

### Li et al. (2022). Competition-level code generation with AlphaCode. Science , 378(6624):1092–1097.

## References, Ref=9

Liang et al. (2023). Code as policies: Language model programs for embodied control. In ICRA , pp. 9493–9500.

## References, Ref=10

Ma et al. (2024). Eureka: Human-level reward design via coding large language models. In ICLR 2024 .

## References, Ref=11

Novikov et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 .

## References, Ref=12

### Petrov et al. (2025). Proof or bluff? Evaluating LLMs on 2025 USA math olympiad. arXiv:2503.21934 .

## References, Ref=13

Ruoss et al. (2024). LMAct: A benchmark for in-context imitation learning with long multimodal demonstrations. arxiv.org/abs/2412.01441 .

## References, Ref=14

Shinn et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS , 36:8634–8652.

## References, Ref=15

Tang et al. (2024). Code repair with LLMs gives an exploration-exploitation tradeoff. NeurIPS , 37:117954–117996.

## References, Ref=16

Valmeekam et al. (2023a). On the planning abilities of large language models — a critical investigation. In NeurIPS .

## References, Ref=17

Valmeekam et al. (2023b). On the planning abilities of large language models-a critical investigation. NeurIPS , 36:75993–76005.

## References, Ref=18

Wang et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 .

## References, Ref=19

Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS , 35:24824–24837.