Abstract
The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively—they are memoryless, condition only on scalar scores, or restrict feedback to short templates or summaries. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4× fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.
Introduction
What is a "Harness"?
In AI systems, a harness is the wrapper code around an LLM that controls what the model sees. It's not the model weights themselves—it's the plumbing: how you retrieve relevant documents, format the prompt, manage memory across turns, and structure the model's output.
Example: Imagine an AI customer support bot. The "harness" decides: do we look up the customer's order history before responding? How many previous messages do we include? Do we add a system prompt like "Be concise"? Changing these harness decisions can radically change how well the bot performs—even with the exact same underlying model.
The paper shows that a 6× performance gap exists just from harness differences on the same model. Meta-Harness automates the search for the best harness.
Changing the harness around a fixed large language model can produce a 6× performance gap on the same benchmark. The harness—the code that determines what to store, retrieve, and show to the model—often matters as much as the model itself. Despite its importance, harness engineering remains largely manual: practitioners inspect failures, adjust heuristics, and iterate on a small number of designs.
Why Existing Methods Fail: The "Compressed Feedback" Problem
Most AI optimization methods send feedback as a brief summary—like a score or a short text note. This works for optimizing a single prompt, but fails for harness engineering because:
- Harnesses have long-range effects. A decision about what to retrieve at step 1 can affect whether the model succeeds at step 20. A score saying "62.9% pass rate" tells you the outcome but not why—and why is what you need to fix the harness.
- Execution traces contain the diagnosis. The full log of what the model saw, what it tried, and where it got stuck is often millions of tokens—too large for compressed summaries to capture faithfully.
Meta-Harness's insight: give the optimizer direct filesystem access to all this raw evidence, and let it decide what to read. This is why it uses a coding agent (which can run grep/cat commands) rather than a simple LLM prompt.
A natural starting point is recent work on text optimization, since harness engineering also involves iteratively improving text and code artifacts using feedback from prior attempts. However, these methods are poorly matched to harness engineering because they operate with short-horizon or heavily compressed feedback: some condition only on the current candidate, others rely primarily on scalar scores, and others restrict feedback to short templates or LLM-generated summaries. Across representative text optimizers, the available context per optimization step ranges from only 100 to 30,000 tokens (Table 1)—far below the diagnostic footprint of harness search.
We address this limitation with Meta-Harness, an agentic harness for optimizing harnesses via end-to-end search. Its proposer is a coding agent—a language-model-based system that can invoke developer tools and modify code. Its key design choice is to expose full history through a filesystem, enabling selective diagnosis of raw prior code and execution traces rather than optimization from compressed per-candidate summaries. In practice, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step. A single evaluation can produce up to 10,000,000 tokens of diagnostic information—roughly three orders of magnitude beyond prior text optimization settings.
Table 1: Text Optimization Method Comparison
| Method | History | Log Content | MTok/iter |
|---|---|---|---|
| OPRO | Window | past (solution, score) pairs | 0.002 |
| TextGrad | Last | textual feedback on current artifact | 0.015 |
| AlphaEvolve | Window | program database + eval. scores | 0.022 |
| GEPA | Summary | reflective feedback from rollout traces | 0.008 |
| Feedback Descent | Summary | comparison + textual feedback | 0.012 |
| TTT-Discover | Window | prev. solution fragment | 0.026 |
| Meta-Harness | Full | all logs and scores | 10.0 |
We evaluate Meta-Harness on online text classification, mathematical reasoning, and agentic coding—demonstrating that richer access to prior experience enables automated harness engineering across diverse domains.
Meta-Harness: A Harness for Optimizing Harnesses
A harness is a stateful program that wraps a language model and determines what context the model sees at each step. For a harness \(H\) and task instance \(x \sim \mathcal{X}'\), we execute a rollout trajectory \(T \sim P_M(H, x)\). The harness constructs prompts for \(M\), the model responds, and the harness updates its state after each interaction.
The objective is to find the harness that maximizes the expected final reward: \(H^* = \arg\max_H \mathbb{E}_{x \sim \mathcal{X}', T \sim P_M(H,x)} r(T, x)\). When multiple objectives are relevant—such as accuracy and context cost—we evaluate candidates under Pareto dominance and report the resulting frontier.
The Formal Objective — Plain Language
The mathematical objective H* = argmax E[r(T, x)] is saying: find the harness code H that makes average task performance as high as possible.
- H = harness (the code being optimized)
- M = the fixed LLM (e.g., GPT-OSS-120B). Its weights never change.
- x ~ X' = a random task (e.g., "classify this legal document")
- T ~ P_M(H, x) = the trajectory (prompts, model responses, tool calls) from running harness H on task x
- r(T, x) = the reward/score for that trajectory (e.g., did the classification match ground truth?)
Pareto dominance: When optimizing two objectives simultaneously (accuracy AND low context token usage), there's no single "best" solution—just a frontier where you can't improve one metric without worsening the other. Meta-Harness maps this entire frontier automatically.
The Search Loop
Meta-Harness uses a single coding-agent proposer (Claude Code with Opus-4.6) with access to a growing filesystem that serves as its feedback channel. Unlike prior systems that externalize the improvement logic in a hand-designed search loop, Meta-Harness delegates diagnosis and proposal to the coding agent itself: it decides which prior artifacts to inspect, which failure modes to address, and whether to make a local edit or a more substantial rewrite.
Each evaluated harness contributes a directory containing its source code, scores, and execution traces. The filesystem is typically far larger than the proposer's context window, so the proposer queries it through terminal tools such as grep and cat rather than ingesting it as a single prompt. In our most demanding setting, the proposer reads a median of 82 files per iteration, referencing over 20 prior candidates per step.
Why Filesystem Access Is the Key Innovation
Most optimization systems have a fixed "window" of what the optimizer can see. Meta-Harness stores everything in a filesystem and lets the coding agent decide what to read.
What the filesystem contains per harness:
- The source code of the harness
- All execution traces—the complete log of every LLM call, tool invocation, and state update during evaluation
- The evaluation score
This enables real causal reasoning: "My last two harnesses both regressed. Let me check their traces... both had a cleanup directive in the prompt. Let me isolate that variable." This is exactly what a human engineer would do—inspect failure cases, form a hypothesis, test it. Meta-Harness automates that process.
# Input: tasks X', LLM M, proposer P, iterations N
Initialize: population H # initial set of valid harnesses
Initialize: filesystem D ← ∅ # stores code, scores, traces
for H ∈ H do:
E_H ← Evaluate(H, M, X')
D ← D ∪ {(H, E_H)}
for t = 1 ... N do:
# Proposer P queries filesystem D
# inspects prior harnesses and scores
P proposes k new harnesses {H₁, ..., Hₖ}
for H ∈ {H₁, ..., Hₖ} do:
if H passes interface validation then:
D ← D ∪ {(H, Evaluate(H, M, X'))}
return Pareto frontier of harnesses in D
In practice: proposer P = Claude Code with Opus-4.6; typical run ≈ 60 harnesses over 20 iterations; single evaluation can produce up to 10,000,000 tokens of diagnostic information.
Filesystem as Feedback Channel
Instead of compressed summaries, the proposer accesses raw code, execution traces, and scores via grep/cat. The agent decides what to inspect—enabling selective diagnosis of root causes rather than optimization from lossy summaries.
Code-Space Search
Each harness is a full Python program. Small changes to retrieval, memory, or prompt-construction logic can affect behavior many reasoning steps later. Coding models naturally bias toward coherent algorithms rather than brittle hard-coded solutions.
Emergent Strategy
No fixed scaffold, archive, or persistent memory mechanism. The proposer often starts from a strong prior harness—an emergent strategy, not a hardcoded rule. The search automatically improves as coding agents become more capable.
Experiment 1: Online Text Classification
We follow the online text classification setup: an LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set. Using GPT-OSS-120B as classifier, we run 20 evolution iterations with 2 candidates per iteration (40 harnesses total), initialized from zero-shot, few-shot, ACE, and MCE baselines.
Meta-Harness Outperforms All Baselines
Table 2: Test-Set Accuracy Across Datasets
| Harness | USPTO | S2D | LawBench | Avg Acc | Ctx (K) |
|---|---|---|---|---|---|
| Zero-Shot | 12.0 | 63.2 | 7.0 | 27.4 | 0 |
| Few-Shot (8) | 14.0 | 67.9 | 21.0 | 34.3 | 2.0 |
| Few-Shot (32) | 13.0 | 72.2 | 21.0 | 35.4 | 7.9 |
| Few-Shot (all) | 15.0 | 78.3 | 29.0 | 40.8 | 12.3 |
| MCE | 14.0 | 83.0 | 23.0 | 40.0 | 28.5 |
| ACE | 16.0 | 77.8 | 29.0 | 40.9 | 50.8 |
| Meta-Harness | 14.0 | 86.8 | 45.0 | 48.6 | 11.4 |
Why the Ablation Result Is Surprising
Table 3 shows something striking: adding LLM-generated summaries of execution traces (Scores + Summary) actually performs worse than scores alone on Best Acc (38.7% vs 41.3%). Why?
The likely explanation: summarization loses diagnostic information. When an LLM summarizes "the harness failed because of poor retrieval", it compresses away the specific failure patterns visible in raw traces. The proposer can no longer see which particular inputs caused failures, what the model actually output, or whether the failure mode was consistent.
This is counterintuitive: compressed information can be worse than less but raw information—at least for systematic optimization tasks like harness engineering.
Table 3: Ablation — What Information Matters?
| Method | Scores | Code | Summaries | Traces | Median Acc | Best Acc | >ZS |
|---|---|---|---|---|---|---|---|
| Scores Only | ✓ | ✓ | ✗ | ✗ | 34.6 | 41.3 | 26 |
| Scores + Summary | ✓ | ✓ | ✓ | ✗ | 34.9 | 38.7 | 23 |
| Meta-Harness (full) | ✓ | ✓ | — | ✓ | 50.0 | 56.7 | 39 |
Accuracy–Context Trade-Off
Because Meta-Harness performs free-form optimization over harness code, it can express a joint preference for both accuracy and context cost. The proposer discovers harnesses across a broad range of the Pareto frontier—yielding a smooth accuracy-context curve. This allows trading additional context for higher accuracy in a controlled way, rather than committing to a single hand-designed operating point.
Table 4: Comparison vs. Text Optimizers (Search Set)
| Method | Median | Best |
|---|---|---|
| GEPA | 32.6 | 40.2 |
| Best-of-N | 34.0 | 44.2 |
| OpenEvolve | 39.1 | 43.3 |
| TTT-Discover | 34.1 | 45.6 |
| Meta-Harness | 50.0 | 56.7 |
What "OOD Generalization" Tells Us
OOD = Out-of-Distribution: these 9 datasets were never seen during the search process. Meta-Harness scoring 73.1% average (vs ACE's 70.2%) on these unseen tasks confirms the discovered harness learned generally effective strategies, not search-set-specific tricks. Notably, adding more few-shot examples beyond 32 hurts performance on 7/9 tasks—suggesting naive context scaling is counterproductive without a smart retrieval strategy.
Table 5: Out-of-Distribution Generalization (9 Unseen Datasets)
| Harness | SciC | FINER | Amz5 | FPB | GoEmo | Bank77 | News | SciT | TwHate | Avg Acc | Ctx ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | 32.7 | 56.0 | 52.7 | 90.0 | 42.0 | 80.7 | 84.7 | 89.3 | 75.3 | 67.0 | — |
| Few-shot (8) | 34.0 | 63.0 | 54.0 | 90.0 | 44.0 | 82.7 | 84.7 | 91.3 | 76.7 | 68.9 | 2.2 |
| Few-shot (32) | 38.7 | 62.0 | 53.3 | 90.7 | 43.3 | 86.0 | 85.3 | 90.7 | 76.7 | 69.6 | 5.2 |
| Few-shot (all) | 35.3 | 61.0 | 50.0 | 93.3 | 42.7 | 80.7 | 84.0 | 90.0 | 76.7 | 68.2 | 7.4 |
| ACE | 40.7 | 74.0 | 48.0 | 96.7 | 44.0 | 83.3 | 86.0 | 90.7 | 68.7 | 70.2 | 11.7 |
| Meta-Harness | 53.3 | 67.0 | 60.0 | 94.0 | 46.0 | 82.7 | 86.7 | 91.3 | 77.3 | 73.1 | 7.3 |
Experiment 2: Retrieval-Augmented Math Reasoning
We study olympiad-level math solving augmented with retrieval from a corpus of 500,000+ solved problems. Naive retrieval rarely works on math benchmarks; success depends on discovering the right retrieval policy. Rather than hand-designing that policy, we give Meta-Harness a hard set of olympiad problems and allow the retrieval behavior to emerge from search.
We run 40 iterations over a 250-problem search set (OlympiadBench + Omni-MATH hard), producing 109 candidate retrieval harnesses. A single harness is selected and evaluated on 200 previously unseen IMO-level problems from IMO-AnswerBench, IMO-ProofBench, and ArXivMath—plus four held-out models never seen during search.
BM25 Retrieval and pass@1 Explained
BM25 is a classic keyword-based search algorithm (backbone of many search engines). For math problems, it finds similar previously-solved problems by matching mathematical keywords—terms like "combinatorics", "modular arithmetic", "convex polygon".
Naive BM25 often retrieves the wrong problems. Meta-Harness's solution: a 4-route lexical router that identifies the problem type and applies different BM25 parameters per route—different k-values, deduplication thresholds, and reranking rules. All discovered automatically across 40 search iterations.
pass@1: the probability that the model solves a problem on the first attempt (averaged over 3 samples). Standard evaluation metric for competition math.
Table 6: Retrieval-Augmented Math Results (200 IMO-Level Problems)
| Method | GPT-5.4n | GPT-5.4m | Gem-3.1FL | Gem-3F | GPT-20B | Avg. |
|---|---|---|---|---|---|---|
| No Retriever | 23.0 | 28.8 | 28.6 | 42.6 | 47.6 | 34.1 |
| Dense Retrieval (k=1) | 27.1 (+4.1) | 24.5 (-4.3) | 31.3 (+2.7) | 42.3 (-0.3) | 46.9 (-0.7) | 34.4 (+0.3) |
| Dense Retrieval (k=5) | 31.1 (+8.1) | 28.3 (-0.5) | 37.1 (+8.5) | 47.2 (+4.6) | 46.7 (-0.9) | 38.1 (+4.0) |
| Random Few-shot | 23.1 (+0.1) | 24.5 (-4.3) | 31.0 (+2.4) | 40.4 (-2.2) | 41.8 (-5.8) | 32.2 (-1.9) |
| BM25 Retrieval | 30.2 (+7.2) | 29.2 (+0.4) | 32.8 (+4.2) | 46.6 (+4.0) | 48.9 (+1.3) | 37.5 (+3.4) |
| Meta-Harness | 31.7 (+8.7) | 30.4 (+1.6) | 34.9 (+6.3) | 46.3 (+3.7) | 50.6 (+3.0) | 38.8 (+4.7) |
Experiment 3: Agentic Coding on TerminalBench-2
TerminalBench-2 evaluates LLM agents on 89 challenging tasks requiring long-horizon, fully autonomous execution under complex dependencies. Harness choice has a large effect on performance. We initialize search from two strong open baselines—Terminus 2 and Terminus-KIRA—and run 10 search iterations. We manually verified that evolved harnesses contain no task-specific string leakage.
Table 7: TerminalBench-2 Leaderboard
| Agent | Auto | Pass Rate (%) |
|---|---|---|
| Claude Opus 4.6 | ||
| Claude Code | ✗ | 58.0% |
| Terminus 2 | ✗ | 62.9% |
| Mux | ✗ | 66.5% |
| Droid | ✗ | 69.9% |
| TongAgents | ✗ | 71.9% |
| MAYA-V2 | ✗ | 72.1% |
| Terminus-KIRA | ✗ | 74.7% |
| Capy | ✗ | 75.3% |
| Meta-Harness AUTO | ✓ | 76.4% |
| ForgeCode | ✗ | 81.8% |
| Claude Haiku 4.5 | ||
| OpenHands | ✗ | 13.9% |
| Claude Code | ✗ | 27.5% |
| Terminus 2 | ✗ | 28.3% |
| Mini-SWE-Agent | ✗ | 29.8% |
| Terminus-KIRA | ✗ | 33.7% |
| Goose | ✗ | 35.5% |
| Meta-Harness AUTO 🏆 #1 | ✓ | 37.6% |
What Is TerminalBench-2?
TerminalBench-2 is a benchmark where an AI agent must complete 89 challenging real-world software tasks autonomously in a command-line environment: compiling code with complex dependencies, setting up services, debugging multi-file projects, and other long-horizon operations requiring domain knowledge and multi-step reasoning.
It's actively contested—multiple industry teams directly optimize their systems for it. That an automated search method can rank #1 among Haiku 4.5 agents is notable because it demonstrates Meta-Harness can find improvements even in a highly competitive frontier.
"Auto" column (✓/✗): whether the harness was discovered automatically (Meta-Harness) or hand-engineered by human practitioners.
Causal Reasoning from Search History
The search trajectory reveals how Meta-Harness achieves its gains. Early iterations combined structural fixes with prompt-template edits and both regressed. By iteration 3, the proposer explicitly hypothesized that regressions were confounded by the shared prompt intervention, isolated the structural changes, and tested them separately. After six regressions, it pivoted to a purely additive approach—adding environment information before the first LLM call without touching the completion flow.
"All 6 prior iterations regressed from the 64.4% baseline because they modified the completion flow, prompt template, or observation processing. evo_env_bootstrap takes a different approach—purely additive. It gathers an environment snapshot before the first LLM call and appends it to the initial prompt. No other methods are changed."
Causal Reasoning in Action — Iteration 7
The search trajectory shows systematic debugging, not random search:
- Iterations 1–2: Both candidates bundled structural fixes with prompt changes. Both regressed.
- Iteration 3: Proposer examined both failure traces, noticed the shared prompt modification as the confound, tested only the structural fixes.
- Iterations 4–6: Still unable to fix completion logic safely. Lesson learned: "touching the completion flow is high-risk."
- Iteration 7: Shifted strategy—don't modify anything, just add an environment snapshot before the first LLM call. Purely additive. No regression risk. This won.
This mirrors how expert engineers debug: form hypotheses, isolate variables, accumulate evidence, pivot when a class of interventions proves fragile.
Inside the Discovered Harnesses
Meta-Harness discovers executable inference-time procedures—structured, domain-specific policies with nontrivial control flow. Here we examine the two text classification harness variants that represent the Pareto frontier extremes, plus generalization evidence.
Table 9: Pareto Frontier of Discovered Text Classification Harnesses
| Variant | USPTO ↑ | Symptom ↑ | LawBench ↑ | Avg ↑ | Ctx ↓ |
|---|---|---|---|---|---|
| Draft Verification | 18.0 | 85.4 | 17.0 | 40.1 | 5.4 |
| Error-Annotated | 9.0 | 87.7 | 24.0 | 40.2 | 22.3 |
| CoT Replay | 13.0 | 88.2 | 25.0 | 42.1 | 23.3 |
| Cluster Coverage | 12.0 | 86.8 | 33.0 | 43.9 | 31.2 |
| Cascade Retrieval | 12.0 | 86.8 | 36.0 | 44.9 | 39.2 |
| Label-Primed Query | 14.0 | 86.8 | 45.0 | 48.6 | 11.4 |
Key Findings
10× Faster, 10+ Points Better
On text classification, Meta-Harness matches the best prior text optimizers (OpenEvolve, TTT-Discover) with 10× fewer evaluations, then surpasses their final accuracy by more than 10 points. Its median candidate outperforms the best candidate found by either ablation.
Cross-Model Transfer on IMO Math
A single discovered retrieval harness improves accuracy by 4.7 points on average across five held-out models on 200 IMO-level problems. The harness was selected based only on GPT-OSS-20B performance but transfers to GPT-5.4-nano, GPT-5.4-mini, Gemini-3.1-Flash-Lite, and Gemini-3-Flash.
#1 Agent on TerminalBench-2 (Haiku 4.5)
Meta-Harness automatically discovers harnesses that rank #1 among all Haiku 4.5 agents (37.6%) and #2 among all Opus 4.6 agents (76.4%) on TerminalBench-2—an actively contested benchmark where multiple teams directly optimize for it.
Discussion
Out-of-Distribution Generalization
Discovered harnesses generalize to unseen classification datasets (+2.9 pts avg on 9 OOD tasks) and to unseen base models in math (+4.7 pts across 5 held-out models). This suggests the discovered strategies capture generally effective context-management principles.
Fast Wall-Clock Time
A search run completes in a few hours, yet produces readable, transferable strategies reusable across current and future models. The harnesses are full Python programs—interpretable and modifiable by engineers.
Inspectable Overfitting
Code-space overfitting (brittle if-chains, hard-coded class mappings) is visible on inspection—unlike weight-space overfitting. This makes it easier to audit whether a discovered harness is genuinely general or merely memorizing.
Richer Prior Experience Is the Key
The main advantage is not just search over code, but search with selective access to prior diagnostic experience. The proposer can inspect raw code, execution traces, and prior failures, then form and test causal hypotheses about what to change.
Our findings reflect a recurring pattern in machine learning: once a search space becomes accessible, stronger general-purpose agents can outperform hand-engineered solutions. A natural next step is to co-evolve harness and model weights—letting the strategy shape what the model learns and vice versa.
The "Bitter Lesson" Connection
The paper references Rich Sutton's "Bitter Lesson" (2019): the recurring pattern in AI where general methods leveraging computation eventually outperform hand-crafted solutions. Chess engines → Go programs → protein folding → now harness engineering. Meta-Harness fits this pattern: automated search + strong coding agents outperforms years of human harness engineering expertise. The key enabling factor is the recent maturation of coding agents capable of navigating large codebases autonomously.
Limitation: our experiments demonstrate harness search with one particularly strong coding-agent proposer (Claude Code with Opus-4.6). How the effect varies across proposer agents and weaker models remains for future work.
Conclusion
Meta-Harness shows that automated harness engineering is practical and effective across diverse domains. By giving a coding-agent proposer selective access to the source code, execution traces, and evaluation scores of all prior candidates through a shared filesystem, Meta-Harness can discover harnesses that outperform hand-engineered baselines on text classification, math reasoning, and agentic coding—while remaining readable, transferable, and efficiently discovered.
Together, these results show that richer access to prior experience can enable automated harness engineering.
Acknowledgements
We thank KRAFTON AI for providing API credit support. This work is supported by OpenAI, KFAS, and Schmidt Sciences AI2050. We thank Anikait Singh and Jubayer Ibn Hamid for their valuable feedback and suggestions, and Sienna J. Lee for patiently listening to YL's half-formed thoughts during the early stages of this work.
References
References (60 entries)
- [1] Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. arXiv:2507.19457, 2025.
- [2] Akyürek et al. What learning algorithm is in-context learning? Investigations with linear models. arXiv:2211.15661, 2023.
- [3] Andrychowicz et al. Learning to learn by gradient descent by gradient descent. NeurIPS, 2016.
- [4] Anthropic. Claude code: An agentic coding tool. https://www.anthropic.com/claude-code, 2025.
- [5] Anthropic and community contributors. agentskills/agentskills. GitHub repository, 2026.
- [6] Balunović et al. Matharena: Evaluating LLMs on uncontaminated math competitions. 2025.
- [7] Barbieri et al. TweetEval: Unified benchmark and comparative evaluation for tweet classification. 2020.
- [8] Beurer-Kellner et al. Prompting is programming: A query language for LLMs. PLDI, 2023.
- [9] Böckeler. Harness engineering. martinfowler.com, March 2026.
- [10] Bölük. I improved 15 LLMs at coding in one afternoon. only the harness changed. 2026.
- [11] Casanueva et al. Efficient intent detection with dual sentence encoders. arXiv:2003.04807, 2020.
- [12] Cemri et al. AdaEvolve: Adaptive LLM driven zeroth-order optimization. arXiv:2602.20133, 2026.
- [13] Chase. LangChain. GitHub, 2022.
- [14] Cohan et al. Structural scaffolds for citation intent classification. arXiv:1904.01608, 2019.
- [15] Demszky et al. GoEmotions: A dataset of fine-grained emotions. arXiv:2005.00547, 2020.
- [16] Fei et al. LawBench: Benchmarking legal knowledge of LLMs. EMNLP, 2024.
- [17] Finn et al. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, 2017.
- [18] ForgeCode. Benchmarks don't matter. 2025.
- [19] Gretel AI. Symptom to diagnosis dataset. HuggingFace, 2023.
- [20] Hu et al. Automated design of agentic systems. ICLR, 2025.
- [21] Young. Effective harnesses for long-running agents. Anthropic Engineering Blog, 2025.
- [22] Keung et al. The multilingual Amazon reviews corpus. arXiv:2010.02573, 2020.
- [23] Khattab et al. DSPy: Compiling declarative LM calls into self-improving pipelines. arXiv:2310.03714, 2023.
- [24] Khot et al. SciTail: A textual entailment dataset from science question answering. AAAI, 2018.
- [25] KRAFTON AI and Ludo Robotics. Terminus-KIRA: Boosting frontier model performance on TerminalBench. GitHub, 2026.
- [26] Lee et al. Feedback Descent: Open-ended text optimization via pairwise comparison. arXiv:2511.07919, 2025.
- [27] Lehman et al. Evolution through large models. arXiv:2206.08896, 2022.
- [28] Lewis et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 2020.
- [29] Loukas et al. FINER: Financial numeric entity recognition for XBRL tagging. ACL, 2022.
- [30] Luong et al. Towards robust mathematical reasoning. EMNLP, 2025.
- [31] Madaan et al. Self-Refine: Iterative refinement with self-feedback. NeurIPS, 2023.
- [32] Malo et al. Good debt or bad debt: Detecting semantic orientations in economic texts. arXiv:1307.5336, 2013.
- [33] Merrill et al. TerminalBench: Benchmarking agents on hard, realistic tasks in CLI. arXiv:2601.11868, 2026.
- [34] Nichols. How we scored #1 on terminal-bench (52%). Warp blog, 2025.
- [35] Novikov et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131, 2025.
- [36] OpenAI. Harness engineering: leveraging Codex in an agent-first world. OpenAI Blog, 2026.
- [37] Packer et al. MemGPT: Towards LLMs as operating systems. 2023.
- [38] Pryzant et al. Automatic prompt optimization with "gradient descent" and beam search. arXiv:2305.03495, 2023.
- [39] Romera-Paredes et al. Mathematical discoveries from program search with LLMs. Nature, 2024.
- [40] Schmidhuber. A neural network that embeds its own meta-levels. IEEE ICNN, 1993.
- [41] Schneider et al. What's what: The (nearly) definitive guide to reaction role assignment. JCIM, 2016.
- [42] Shakya et al. Adaptive retrieval helps reasoning in LLMs – but mostly if it's not used. arXiv:2602.07213, 2026.
- [43] Sharma. OpenEvolve: an open-source evolutionary coding agent. GitHub, 2025.
- [44] Snell et al. Prototypical networks for few-shot learning. NeurIPS, 2017.
- [45] Sutton. The bitter lesson. 2019.
- [46] Thrun and Pratt. Learning to learn: Introduction and overview. Springer, 1998.
- [47] Tian et al. SWE-Bench Mobile: Can LLM agents develop industry-level mobile applications? arXiv, 2026.
- [48] Trivedi et al. Interleaving retrieval with chain-of-thought reasoning. arXiv:2212.10509, 2023.
- [49] Xiao et al. RAR-B: Reasoning as retrieval benchmark. arXiv:2404.06347, 2024.
- [50] Xiong et al. Learning to continually learn via meta-learning agentic memory designs. OpenReview, 2026.
- [51] Yang et al. Large language models as optimizers (OPRO). ICLR, 2023.
- [52] Ye et al. Meta context engineering via agentic skill evolution (MCE). arXiv:2601.21557, 2026.
- [53] Yuksekgonul et al. TextGrad: Automatic "differentiation" via text. arXiv:2406.07496, 2024.
- [54] Yuksekgonul et al. Learning to discover at test time (TTT-Discover). arXiv:2601.16175, 2026.
- [55] Yuksekgonul et al. Learning to discover at test time. arXiv:2601.16175, 2026.
- [56] Zhang et al. Recursive language models. arXiv:2512.24601, 2026.
- [57] Zhang et al. MemEvolve: Meta-evolution of agent memory systems. arXiv:2512.18746, 2025.
- [58] Zhang et al. AFlow: Automating agentic workflow generation. arXiv:2410.10762, 2025.
- [59] Zhang et al. Agentic context engineering (ACE). arXiv:2510.04618, 2025.
- [60] Zhang et al. Character-level convolutional networks for text classification. arXiv:1509.01626, 2016.