Meta-Harness: End-to-End Optimization of Model Harnesses

Abstract

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively—they are memoryless, condition only on scalar scores, or restrict feedback to short templates or summaries. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4× fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.

Text Classification

+7.7 pts

over ACE, using 4× fewer context tokens

Math Reasoning (IMO)

+4.7 pts avg

Transfers across 5 held-out models

Agentic Coding

#1 Haiku 4.5

#2 Opus 4.6 on TerminalBench-2 leaderboard

Introduction

What is a "Harness"?

In AI systems, a harness is the wrapper code around an LLM that controls what the model sees. It's not the model weights themselves—it's the plumbing: how you retrieve relevant documents, format the prompt, manage memory across turns, and structure the model's output.

Example: Imagine an AI customer support bot. The "harness" decides: do we look up the customer's order history before responding? How many previous messages do we include? Do we add a system prompt like "Be concise"? Changing these harness decisions can radically change how well the bot performs—even with the exact same underlying model.

The paper shows that a 6× performance gap exists just from harness differences on the same model. Meta-Harness automates the search for the best harness.

Changing the harness around a fixed large language model can produce a 6× performance gap on the same benchmark. The harness—the code that determines what to store, retrieve, and show to the model—often matters as much as the model itself. Despite its importance, harness engineering remains largely manual: practitioners inspect failures, adjust heuristics, and iterate on a small number of designs.

Why Existing Methods Fail: The "Compressed Feedback" Problem

Most AI optimization methods send feedback as a brief summary—like a score or a short text note. This works for optimizing a single prompt, but fails for harness engineering because:

Harnesses have long-range effects. A decision about what to retrieve at step 1 can affect whether the model succeeds at step 20. A score saying "62.9% pass rate" tells you the outcome but not why—and why is what you need to fix the harness.
Execution traces contain the diagnosis. The full log of what the model saw, what it tried, and where it got stuck is often millions of tokens—too large for compressed summaries to capture faithfully.

Meta-Harness's insight: give the optimizer direct filesystem access to all this raw evidence, and let it decide what to read. This is why it uses a coding agent (which can run grep/cat commands) rather than a simple LLM prompt.

A natural starting point is recent work on text optimization, since harness engineering also involves iteratively improving text and code artifacts using feedback from prior attempts. However, these methods are poorly matched to harness engineering because they operate with short-horizon or heavily compressed feedback: some condition only on the current candidate, others rely primarily on scalar scores, and others restrict feedback to short templates or LLM-generated summaries. Across representative text optimizers, the available context per optimization step ranges from only 100 to 30,000 tokens (Table 1)—far below the diagnostic footprint of harness search.

We address this limitation with Meta-Harness, an agentic harness for optimizing harnesses via end-to-end search. Its proposer is a coding agent—a language-model-based system that can invoke developer tools and modify code. Its key design choice is to expose full history through a filesystem, enabling selective diagnosis of raw prior code and execution traces rather than optimization from compressed per-candidate summaries. In practice, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step. A single evaluation can produce up to 10,000,000 tokens of diagnostic information—roughly three orders of magnitude beyond prior text optimization settings.

Table 1: Text Optimization Method Comparison

Comparison of text optimization methods and their feedback settings. MTok/iter is the estimated context generated per evaluation. Meta-Harness operates at orders-of-magnitude larger scale than all prior methods.
Method	History	Log Content	MTok/iter
OPRO	Window	past (solution, score) pairs	0.002
TextGrad	Last	textual feedback on current artifact	0.015
AlphaEvolve	Window	program database + eval. scores	0.022
GEPA	Summary	reflective feedback from rollout traces	0.008
Feedback Descent	Summary	comparison + textual feedback	0.012
TTT-Discover	Window	prev. solution fragment	0.026
Meta-Harness	Full	all logs and scores	10.0

We evaluate Meta-Harness on online text classification, mathematical reasoning, and agentic coding—demonstrating that richer access to prior experience enables automated harness engineering across diverse domains.

Related Work

Meta-Harness brings ideas from credit assignment and meta-learning into a new regime enabled by recent advances in coding agents. Rather than updating model weights, the system assigns credit at the harness level. It sits at the intersection of three research threads:

🧠

External Memory & Adaptive Access

RAG, interleaved retrieval, memory-based agents, and recursive language models all treat large knowledge sources as external resources accessed adaptively. Meta-Harness extends this pattern to harness engineering itself, where the proposer selectively inspects a large external history of code, scores, and execution traces.

🔍

Executable Code Search

Methods evolving functions within fixed scaffolds, using meta-agents to program new agents, or searching over workflow graphs for agentic systems. Meta-Harness differs: unrestricted filesystem access to prior experience enables full harness optimization rather than search over a predefined space of context-management procedures.

✏️

Text Optimization Methods

ProTeGi, TextGrad, OPRO, GEPA, AlphaEvolve/OpenEvolve, and Feedback Descent iteratively improve prompts or text artifacts. These are less suited to harness engineering because they target complete executable procedures where relevant feedback is distributed across code, scores, and execution traces—hard to summarize up front.

Meta-Harness: A Harness for Optimizing Harnesses

OBJECTIVE

A harness is a stateful program that wraps a language model and determines what context the model sees at each step. For a harness \(H\) and task instance \(x \sim \mathcal{X}'\), we execute a rollout trajectory \(T \sim P_M(H, x)\). The harness constructs prompts for \(M\), the model responds, and the harness updates its state after each interaction.

The objective is to find the harness that maximizes the expected final reward: \(H^* = \arg\max_H \mathbb{E}_{x \sim \mathcal{X}', T \sim P_M(H,x)} r(T, x)\). When multiple objectives are relevant—such as accuracy and context cost—we evaluate candidates under Pareto dominance and report the resulting frontier.

The Formal Objective — Plain Language

The mathematical objective H* = argmax E[r(T, x)] is saying: find the harness code H that makes average task performance as high as possible.

H = harness (the code being optimized)
M = the fixed LLM (e.g., GPT-OSS-120B). Its weights never change.
x ~ X' = a random task (e.g., "classify this legal document")
T ~ P_M(H, x) = the trajectory (prompts, model responses, tool calls) from running harness H on task x
r(T, x) = the reward/score for that trajectory (e.g., did the classification match ground truth?)

Pareto dominance: When optimizing two objectives simultaneously (accuracy AND low context token usage), there's no single "best" solution—just a frontier where you can't improve one metric without worsening the other. Meta-Harness maps this entire frontier automatically.

Meta-Harness search loop diagram — **Figure 2:** Meta-Harness search loop. (1) An agent reads a filesystem containing all prior candidates' source code, execution traces, and scores, and proposes a new harness. (2) We evaluate the proposed harness on evaluation tasks. (3) All logs are stored in the filesystem in a new directory, and the loop repeats.

The Search Loop

Meta-Harness uses a single coding-agent proposer (Claude Code with Opus-4.6) with access to a growing filesystem that serves as its feedback channel. Unlike prior systems that externalize the improvement logic in a hand-designed search loop, Meta-Harness delegates diagnosis and proposal to the coding agent itself: it decides which prior artifacts to inspect, which failure modes to address, and whether to make a local edit or a more substantial rewrite.

Each evaluated harness contributes a directory containing its source code, scores, and execution traces. The filesystem is typically far larger than the proposer's context window, so the proposer queries it through terminal tools such as grep and cat rather than ingesting it as a single prompt. In our most demanding setting, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step.

Why Filesystem Access Is the Key Innovation

Most optimization systems have a fixed "window" of what the optimizer can see. Meta-Harness stores everything in a filesystem and lets the coding agent decide what to read.

What the filesystem contains per harness:

The source code of the harness
All execution traces—the complete log of every LLM call, tool invocation, and state update during evaluation
The evaluation score

This enables real causal reasoning: "My last two harnesses both regressed. Let me check their traces... both had a cleanup directive in the prompt. Let me isolate that variable." This is exactly what a human engineer would do—inspect failure cases, form a hypothesis, test it. Meta-Harness automates that process.

ALGORITHM 1 — Meta-Harness Outer Loop

# Input: tasks X', LLM M, proposer P, iterations N
Initialize: population H  # initial set of valid harnesses
Initialize: filesystem D ← ∅  # stores code, scores, traces

for H ∈ H do:
    E_H ← Evaluate(H, M, X')
    D ← D ∪ {(H, E_H)}

for t = 1 ... N do:
    # Proposer P queries filesystem D
    # inspects prior harnesses and scores
    P proposes k new harnesses {H₁, ..., Hₖ}
    for H ∈ {H₁, ..., Hₖ} do:
        if H passes interface validation then:
            D ← D ∪ {(H, Evaluate(H, M, X'))}

return Pareto frontier of harnesses in D

In practice: proposer P = Claude Code with Opus-4.6; typical run ≈ 60 harnesses over 20 iterations; single evaluation can produce up to 10,000,000 tokens of diagnostic information.

📁

Filesystem as Feedback Channel

Instead of compressed summaries, the proposer accesses raw code, execution traces, and scores via grep/cat. The agent decides what to inspect—enabling selective diagnosis of root causes rather than optimization from lossy summaries.

💻

Code-Space Search

Each harness is a full Python program. Small changes to retrieval, memory, or prompt-construction logic can affect behavior many reasoning steps later. Coding models naturally bias toward coherent algorithms rather than brittle hard-coded solutions.

🌱

Emergent Strategy

No fixed scaffold, archive, or persistent memory mechanism. The proposer often starts from a strong prior harness—an emergent strategy, not a hardcoded rule. The search automatically improves as coding agents become more capable.

Experiment 1: Online Text Classification

We follow the online text classification setup: an LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set. Using GPT-OSS-120B as classifier, we run 20 evolution iterations with 2 candidates per iteration (40 harnesses total), initialized from zero-shot, few-shot, ACE, and MCE baselines.

LawBench: 215 classes, criminal charges Symptom2Disease: 22 classes, medical diagnoses USPTO-50k: 180 classes, chemical reactions

Meta-Harness Outperforms All Baselines

+7.7 pts above ACE accuracy

4× fewer context tokens vs ACE

48.6% average accuracy (best harness)

Table 2: Test-Set Accuracy Across Datasets

Test-set metrics for all harnesses on the three datasets. Ctx = additional context tokens (thousands). Meta-Harness achieves highest accuracy (48.6%) while using fewer context tokens than ACE (11.4K vs 50.8K).
Harness	USPTO	S2D	LawBench	Avg Acc	Ctx (K)
Zero-Shot	12.0	63.2	7.0	27.4	0
Few-Shot (8)	14.0	67.9	21.0	34.3	2.0
Few-Shot (32)	13.0	72.2	21.0	35.4	7.9
Few-Shot (all)	15.0	78.3	29.0	40.8	12.3
MCE	14.0	83.0	23.0	40.0	28.5
ACE	16.0	77.8	29.0	40.9	50.8
Meta-Harness	14.0	86.8	45.0	48.6	11.4

Why the Ablation Result Is Surprising

Table 3 shows something striking: adding LLM-generated summaries of execution traces (Scores + Summary) actually performs worse than scores alone on Best Acc (38.7% vs 41.3%). Why?

The likely explanation: summarization loses diagnostic information. When an LLM summarizes "the harness failed because of poor retrieval", it compresses away the specific failure patterns visible in raw traces. The proposer can no longer see which particular inputs caused failures, what the model actually output, or whether the failure mode was consistent.

This is counterintuitive: compressed information can be worse than less but raw information—at least for systematic optimization tasks like harness engineering.

Table 3: Ablation — What Information Matters?

Ablation of the information available to the proposer. Access to raw execution traces is the critical ingredient: the full Meta-Harness interface achieves 50.0 median vs 34.9 for scores+summary, and even its median candidate outperforms the best candidate found under either ablation.
Method	Scores	Code	Summaries	Traces	Median Acc	Best Acc	>ZS
Scores Only	✓	✓	✗	✗	34.6	41.3	26
Scores + Summary	✓	✓	✓	✗	34.9	38.7	23
Meta-Harness (full)	✓	✓	—	✓	50.0	56.7	39

Pareto frontier of accuracy vs context tokens — **Figure 3:** Pareto frontier of accuracy vs. context tokens on online text classification. Meta-Harness (red) achieves a stronger accuracy-context Pareto frontier than all comparison methods, including MCE, ACE, and zero/few-shot baselines.

Accuracy–Context Trade-Off

Because Meta-Harness performs free-form optimization over harness code, it can express a joint preference for both accuracy and context cost. The proposer discovers harnesses across a broad range of the Pareto frontier—yielding a smooth accuracy-context curve. This allows trading additional context for higher accuracy in a controlled way, rather than committing to a single hand-designed operating point.

Table 4: Comparison vs. Text Optimizers (Search Set)

Meta-Harness matches best prior optimizers (OpenEvolve, TTT-Discover) in 0.1× the evaluations, then surpasses them by more than 10 points.
Method	Median	Best
GEPA	32.6	40.2
Best-of-N	34.0	44.2
OpenEvolve	39.1	43.3
TTT-Discover	34.1	45.6
Meta-Harness	50.0	56.7

What "OOD Generalization" Tells Us

OOD = Out-of-Distribution: these 9 datasets were never seen during the search process. Meta-Harness scoring 73.1% average (vs ACE's 70.2%) on these unseen tasks confirms the discovered harness learned generally effective strategies, not search-set-specific tricks. Notably, adding more few-shot examples beyond 32 hurts performance on 7/9 tasks—suggesting naive context scaling is counterproductive without a smart retrieval strategy.

Table 5: Out-of-Distribution Generalization (9 Unseen Datasets)

OOD text classification evaluation. Meta-Harness outperforms ACE by 2.9 points average on 9 previously unseen tasks, achieving the best accuracy on 6 of 9 datasets—confirming that discovered harnesses capture generally effective strategies.
Harness	SciC	FINER	Amz5	FPB	GoEmo	Bank77	News	SciT	TwHate	Avg Acc	Ctx ↓
Zero-shot	32.7	56.0	52.7	90.0	42.0	80.7	84.7	89.3	75.3	67.0	—
Few-shot (8)	34.0	63.0	54.0	90.0	44.0	82.7	84.7	91.3	76.7	68.9	2.2
Few-shot (32)	38.7	62.0	53.3	90.7	43.3	86.0	85.3	90.7	76.7	69.6	5.2
Few-shot (all)	35.3	61.0	50.0	93.3	42.7	80.7	84.0	90.0	76.7	68.2	7.4
ACE	40.7	74.0	48.0	96.7	44.0	83.3	86.0	90.7	68.7	70.2	11.7
Meta-Harness	53.3	67.0	60.0	94.0	46.0	82.7	86.7	91.3	77.3	73.1	7.3

Experiment 2: Retrieval-Augmented Math Reasoning

We study olympiad-level math solving augmented with retrieval from a corpus of 500,000+ solved problems. Naive retrieval rarely works on math benchmarks; success depends on discovering the right retrieval policy. Rather than hand-designing that policy, we give Meta-Harness a hard set of olympiad problems and allow the retrieval behavior to emerge from search.

We run 40 iterations over a 250-problem search set (OlympiadBench + Omni-MATH hard), producing 109 candidate retrieval harnesses. A single harness is selected and evaluated on 200 previously unseen IMO-level problems from IMO-AnswerBench, IMO-ProofBench, and ArXivMath—plus four held-out models never seen during search.

BM25 Retrieval and pass@1 Explained

BM25 is a classic keyword-based search algorithm (backbone of many search engines). For math problems, it finds similar previously-solved problems by matching mathematical keywords—terms like "combinatorics", "modular arithmetic", "convex polygon".

Naive BM25 often retrieves the wrong problems. Meta-Harness's solution: a 4-route lexical router that identifies the problem type and applies different BM25 parameters per route—different k-values, deduplication thresholds, and reranking rules. All discovered automatically across 40 search iterations.

pass@1: the probability that the model solves a problem on the first attempt (averaged over 3 samples). Standard evaluation metric for competition math.

Table 6: Retrieval-Augmented Math Results (200 IMO-Level Problems)

Pass@1 accuracy on 200 IMO-level problems averaged over three samples per problem, with absolute improvement over no-retrieval in parentheses. Meta-Harness improves over no-retrieval on all five held-out models, averaging +4.7 points—the strongest overall result.
Method	GPT-5.4n	GPT-5.4m	Gem-3.1FL	Gem-3F	GPT-20B	Avg.
No Retriever	23.0	28.8	28.6	42.6	47.6	34.1
Dense Retrieval (k=1)	27.1 (+4.1)	24.5 (-4.3)	31.3 (+2.7)	42.3 (-0.3)	46.9 (-0.7)	34.4 (+0.3)
Dense Retrieval (k=5)	31.1 (+8.1)	28.3 (-0.5)	37.1 (+8.5)	47.2 (+4.6)	46.7 (-0.9)	38.1 (+4.0)
Random Few-shot	23.1 (+0.1)	24.5 (-4.3)	31.0 (+2.4)	40.4 (-2.2)	41.8 (-5.8)	32.2 (-1.9)
BM25 Retrieval	30.2 (+7.2)	29.2 (+0.4)	32.8 (+4.2)	46.6 (+4.0)	48.9 (+1.3)	37.5 (+3.4)
Meta-Harness	31.7 (+8.7)	30.4 (+1.6)	34.9 (+6.3)	46.3 (+3.7)	50.6 (+3.0)	38.8 (+4.7)

4-route BM25 math retrieval harness flowchart — **Figure 4 (Math Harness):** The discovered 4-route BM25 retrieval harness. A lexical router dispatches to specialized routes for Combinatorics, Geometry, Number theory, and Algebra/Other—each with different BM25 parameters, deduplication thresholds, reranking rules, and example counts, all discovered automatically.

Search-set accuracy over evaluations for text optimizers — **Figure 3 (Search Progress):** Search-set accuracy over evaluations for all text optimizers. Meta-Harness (red) reaches the final accuracy of OpenEvolve and TTT-Discover within the first 4 evaluations, then continues improving to end 10+ points above all baselines.

Experiment 3: Agentic Coding on TerminalBench-2

TerminalBench-2 evaluates LLM agents on 89 challenging tasks requiring long-horizon, fully autonomous execution under complex dependencies. Harness choice has a large effect on performance. We initialize search from two strong open baselines—Terminus 2 and Terminus-KIRA—and run 10 search iterations. We manually verified that evolved harnesses contain no task-specific string leakage.

Table 7: TerminalBench-2 Leaderboard

Pass rate on TerminalBench-2. Results for others are from the official leaderboard. Meta-Harness is the only automated (AUTO) entry, ranking #2 among all Opus 4.6 agents and #1 among all Haiku 4.5 agents.
Agent	Auto	Pass Rate (%)
Claude Opus 4.6
Claude Code	✗	58.0%
Terminus 2	✗	62.9%
Mux	✗	66.5%
Droid	✗	69.9%
TongAgents	✗	71.9%
MAYA-V2	✗	72.1%
Terminus-KIRA	✗	74.7%
Capy	✗	75.3%
Meta-Harness AUTO	✓	76.4%
ForgeCode	✗	81.8%
Claude Haiku 4.5
OpenHands	✗	13.9%
Claude Code	✗	27.5%
Terminus 2	✗	28.3%
Mini-SWE-Agent	✗	29.8%
Terminus-KIRA	✗	33.7%
Goose	✗	35.5%
Meta-Harness AUTO 🏆 #1	✓	37.6%

What Is TerminalBench-2?

TerminalBench-2 is a benchmark where an AI agent must complete 89 challenging real-world software tasks autonomously in a command-line environment: compiling code with complex dependencies, setting up services, debugging multi-file projects, and other long-horizon operations requiring domain knowledge and multi-step reasoning.

It's actively contested—multiple industry teams directly optimize their systems for it. That an automated search method can rank #1 among Haiku 4.5 agents is notable because it demonstrates Meta-Harness can find improvements even in a highly competitive frontier.

"Auto" column (✓/✗): whether the harness was discovered automatically (Meta-Harness) or hand-engineered by human practitioners.

TerminalBench-2 agentic coding harness flowchart — **Figure 5 (Coding Harness):** The discovered TerminalBench-2 harness. A purely additive change: environment bootstrap (pwd, files, languages, package managers) is gathered before the first LLM call and appended to the initial prompt. A multi-perspective completion checklist controls loop termination. No prompt template or control flow modified.

Causal Reasoning from Search History

The search trajectory reveals how Meta-Harness achieves its gains. Early iterations combined structural fixes with prompt-template edits and both regressed. By iteration 3, the proposer explicitly hypothesized that regressions were confounded by the shared prompt intervention, isolated the structural changes, and tested them separately. After six regressions, it pivoted to a purely additive approach—adding environment information before the first LLM call without touching the completion flow.

"All 6 prior iterations regressed from the 64.4% baseline because they modified the completion flow, prompt template, or observation processing. evo_env_bootstrap takes a different approach—purely additive. It gathers an environment snapshot before the first LLM call and appends it to the initial prompt. No other methods are changed."

Causal Reasoning in Action — Iteration 7

The search trajectory shows systematic debugging, not random search:

Iterations 1–2: Both candidates bundled structural fixes with prompt changes. Both regressed.
Iteration 3: Proposer examined both failure traces, noticed the shared prompt modification as the confound, tested only the structural fixes.
Iterations 4–6: Still unable to fix completion logic safely. Lesson learned: "touching the completion flow is high-risk."
Iteration 7: Shifted strategy—don't modify anything, just add an environment snapshot before the first LLM call. Purely additive. No regression risk. This won.

This mirrors how expert engineers debug: form hypotheses, isolate variables, accumulate evidence, pivot when a class of interventions proves fragile.

Inside the Discovered Harnesses

Meta-Harness discovers executable inference-time procedures—structured, domain-specific policies with nontrivial control flow. Here we examine the two text classification harness variants that represent the Pareto frontier extremes, plus generalization evidence.

Draft-verification classification harness flowchart — **Figure 5 (Draft-Verification Harness):** A two-call classification procedure. First call: retrieve 5 similar examples, produce a draft label. Second call: retrieve 5 confirmers (same label) and 5 challengers (different label) conditioned on the draft, then decide whether to keep or revise. Context-efficient—sits at the low end of the Pareto frontier.

Label-primed query-anchored classification harness flowchart — **Figure 6 (Label-Primed Harness):** The highest-accuracy variant. A single large prompt containing: (1) label primer listing all valid output labels, (2) TF-IDF retrieval with query-anchored pairing, (3) coverage block—best example per label, (4) contrastive pairs—similar examples with different labels. Covers the full label space and local decision boundaries in one LLM call.

Search-set vs test-set accuracy scatter plot — **Figure 7 (Generalization Check):** Search-set accuracy vs. test-set accuracy across LawBench, Symptom2Disease, and USPTO. The positive correlation confirms that discovered harnesses generalize—they are not overfit to the specific search set.

Table 9: Pareto Frontier of Discovered Text Classification Harnesses

Non-dominated variants from the main text classification search. All selected solely by search-set performance. Label-Primed Query achieves highest average accuracy (48.6%); Draft Verification is the lowest-context option (5.4K tokens).
Variant	USPTO ↑	Symptom ↑	LawBench ↑	Avg ↑	Ctx ↓
Draft Verification	18.0	85.4	17.0	40.1	5.4
Error-Annotated	9.0	87.7	24.0	40.2	22.3
CoT Replay	13.0	88.2	25.0	42.1	23.3
Cluster Coverage	12.0	86.8	33.0	43.9	31.2
Cascade Retrieval	12.0	86.8	36.0	44.9	39.2
Label-Primed Query	14.0	86.8	45.0	48.6	11.4

Key Findings

⚡

10× Faster, 10+ Points Better

On text classification, Meta-Harness matches the best prior text optimizers (OpenEvolve, TTT-Discover) with 10× fewer evaluations, then surpasses their final accuracy by more than 10 points. Its median candidate outperforms the best candidate found by either ablation.

🧮

Cross-Model Transfer on IMO Math

A single discovered retrieval harness improves accuracy by 4.7 points on average across five held-out models on 200 IMO-level problems. The harness was selected based only on GPT-OSS-20B performance but transfers to GPT-5.4-nano, GPT-5.4-mini, Gemini-3.1-Flash-Lite, and Gemini-3-Flash.

🏆

#1 Agent on TerminalBench-2 (Haiku 4.5)

Meta-Harness automatically discovers harnesses that rank #1 among all Haiku 4.5 agents (37.6%) and #2 among all Opus 4.6 agents (76.4%) on TerminalBench-2—an actively contested benchmark where multiple teams directly optimize for it.

Discussion

🌍

Out-of-Distribution Generalization

Discovered harnesses generalize to unseen classification datasets (+2.9 pts avg on 9 OOD tasks) and to unseen base models in math (+4.7 pts across 5 held-out models). This suggests the discovered strategies capture generally effective context-management principles.

⏱️

Fast Wall-Clock Time

A search run completes in a few hours, yet produces readable, transferable strategies reusable across current and future models. The harnesses are full Python programs—interpretable and modifiable by engineers.

🔍

Inspectable Overfitting

Code-space overfitting (brittle if-chains, hard-coded class mappings) is visible on inspection—unlike weight-space overfitting. This makes it easier to audit whether a discovered harness is genuinely general or merely memorizing.

🎯

Richer Prior Experience Is the Key

The main advantage is not just search over code, but search with selective access to prior diagnostic experience. The proposer can inspect raw code, execution traces, and prior failures, then form and test causal hypotheses about what to change.

Our findings reflect a recurring pattern in machine learning: once a search space becomes accessible, stronger general-purpose agents can outperform hand-engineered solutions. A natural next step is to co-evolve harness and model weights—letting the strategy shape what the model learns and vice versa.

The "Bitter Lesson" Connection

The paper references Rich Sutton's "Bitter Lesson" (2019): the recurring pattern in AI where general methods leveraging computation eventually outperform hand-crafted solutions. Chess engines → Go programs → protein folding → now harness engineering. Meta-Harness fits this pattern: automated search + strong coding agents outperforms years of human harness engineering expertise. The key enabling factor is the recent maturation of coding agents capable of navigating large codebases autonomously.

Limitation: our experiments demonstrate harness search with one particularly strong coding-agent proposer (Claude Code with Opus-4.6). How the effect varies across proposer agents and weaker models remains for future work.

Conclusion

Meta-Harness shows that automated harness engineering is practical and effective across diverse domains. By giving a coding-agent proposer selective access to the source code, execution traces, and evaluation scores of all prior candidates through a shared filesystem, Meta-Harness can discover harnesses that outperform hand-engineered baselines on text classification, math reasoning, and agentic coding—while remaining readable, transferable, and efficiently discovered.

Together, these results show that richer access to prior experience can enable automated harness engineering.

Acknowledgements

We thank KRAFTON AI for providing API credit support. This work is supported by OpenAI, KFAS, and Schmidt Sciences AI2050. We thank Anikait Singh and Jubayer Ibn Hamid for their valuable feedback and suggestions, and Sienna J. Lee for patiently listening to YL's half-formed thoughts during the early stages of this work.

References

References (60 entries)

[1] Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025.
[2] Akyürek et al. What learning algorithm is in-context learning? Investigations with linear models. arXiv:2211.15661, 2023.
[3] Andrychowicz et al. Learning to learn by gradient descent by gradient descent. NeurIPS, 2016.
[4] Anthropic. Claude code: An agentic coding tool. https://www.anthropic.com/claude-code, 2025.
[5] Anthropic and community contributors. agentskills/agentskills. GitHub repository, 2026.
[6] Balunović et al. Matharena: Evaluating LLMs on uncontaminated math competitions. 2025.
[7] Barbieri et al. TweetEval: Unified benchmark and comparative evaluation for tweet classification. 2020.
[8] Beurer-Kellner et al. Prompting is programming: A query language for LLMs. PLDI, 2023.
[9] Böckeler. Harness engineering. martinfowler.com, March 2026.
[10] Bölük. I improved 15 LLMs at coding in one afternoon. only the harness changed. 2026.
[11] Casanueva et al. Efficient intent detection with dual sentence encoders. arXiv:2003.04807, 2020.
[12] Cemri et al. AdaEvolve: Adaptive LLM driven zeroth-order optimization. arXiv:2602.20133, 2026.
[13] Chase. LangChain. GitHub, 2022.
[14] Cohan et al. Structural scaffolds for citation intent classification. arXiv:1904.01608, 2019.
[15] Demszky et al. GoEmotions: A dataset of fine-grained emotions. arXiv:2005.00547, 2020.
[16] Fei et al. LawBench: Benchmarking legal knowledge of LLMs. EMNLP, 2024.
[17] Finn et al. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, 2017.
[18] ForgeCode. Benchmarks don't matter. 2025.
[19] Gretel AI. Symptom to diagnosis dataset. HuggingFace, 2023.
[20] Hu et al. Automated design of agentic systems. ICLR, 2025.
[21] Young. Effective harnesses for long-running agents. Anthropic Engineering Blog, 2025.
[22] Keung et al. The multilingual Amazon reviews corpus. arXiv:2010.02573, 2020.
[23] Khattab et al. DSPy: Compiling declarative LM calls into self-improving pipelines. arXiv:2310.03714, 2023.
[24] Khot et al. SciTail: A textual entailment dataset from science question answering. AAAI, 2018.
[25] KRAFTON AI and Ludo Robotics. Terminus-KIRA: Boosting frontier model performance on TerminalBench. GitHub, 2026.
[26] Lee et al. Feedback Descent: Open-ended text optimization via pairwise comparison. arXiv:2511.07919, 2025.
[27] Lehman et al. Evolution through large models. arXiv:2206.08896, 2022.
[28] Lewis et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 2020.
[29] Loukas et al. FINER: Financial numeric entity recognition for XBRL tagging. ACL, 2022.
[30] Luong et al. Towards robust mathematical reasoning. EMNLP, 2025.
[31] Madaan et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.
[32] Malo et al. Good debt or bad debt: Detecting semantic orientations in economic texts. arXiv:1307.5336, 2013.
[33] Merrill et al. TerminalBench: Benchmarking agents on hard, realistic tasks in CLI. arXiv:2601.11868, 2026.
[34] Nichols. How we scored #1 on terminal-bench (52%). Warp blog, 2025.
[35] Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131, 2025.
[36] OpenAI. Harness engineering: leveraging Codex in an agent-first world. OpenAI Blog, 2026.
[37] Packer et al. MemGPT: Towards LLMs as operating systems. 2023.
[38] Pryzant et al. Automatic prompt optimization with "gradient descent" and beam search. arXiv:2305.03495, 2023.
[39] Romera-Paredes et al. Mathematical discoveries from program search with LLMs. Nature, 2024.
[40] Schmidhuber. A neural network that embeds its own meta-levels. IEEE ICNN, 1993.
[41] Schneider et al. What's what: The (nearly) definitive guide to reaction role assignment. JCIM, 2016.
[42] Shakya et al. Adaptive retrieval helps reasoning in LLMs – but mostly if it's not used. arXiv:2602.07213, 2026.
[43] Sharma. OpenEvolve: an open-source evolutionary coding agent. GitHub, 2025.
[44] Snell et al. Prototypical networks for few-shot learning. NeurIPS, 2017.
[45] Sutton. The bitter lesson. 2019.
[46] Thrun and Pratt. Learning to learn: Introduction and overview. Springer, 1998.
[47] Tian et al. SWE-Bench Mobile: Can LLM agents develop industry-level mobile applications? arXiv, 2026.
[48] Trivedi et al. Interleaving retrieval with chain-of-thought reasoning. arXiv:2212.10509, 2023.
[49] Xiao et al. RAR-B: Reasoning as retrieval benchmark. arXiv:2404.06347, 2024.
[50] Xiong et al. Learning to continually learn via meta-learning agentic memory designs. OpenReview, 2026.
[51] Yang et al. Large language models as optimizers (OPRO). ICLR, 2023.
[52] Ye et al. Meta context engineering via agentic skill evolution (MCE). arXiv:2601.21557, 2026.
[53] Yuksekgonul et al. TextGrad: Automatic "differentiation" via text. arXiv:2406.07496, 2024.
[54] Yuksekgonul et al. Learning to discover at test time (TTT-Discover). arXiv:2601.16175, 2026.
[55] Yuksekgonul et al. Learning to discover at test time. arXiv:2601.16175, 2026.
[56] Zhang et al. Recursive language models. arXiv:2512.24601, 2026.
[57] Zhang et al. MemEvolve: Meta-evolution of agent memory systems. arXiv:2512.18746, 2025.
[58] Zhang et al. AFlow: Automating agentic workflow generation. arXiv:2410.10762, 2025.
[59] Zhang et al. Agentic context engineering (ACE). arXiv:2510.04618, 2025.
[60] Zhang et al. Character-level convolutional networks for text classification. arXiv:1509.01626, 2016.