Can agent control logic be externalized as a portable, executable natural-language artifact? This paper introduces NLAHs and a shared Intelligent Harness Runtime (IHR) that makes it possible — with controlled evidence across coding and computer-use benchmarks.
Formalizes the harness design-pattern layer as an explicit, portable representation object distinct from runtime policy.
Introduces IHR, an in-loop LLM runtime that interprets harness logic directly while cleanly separating runtime charter from task logic.
Three controlled experiments on behavioral effect (RQ1), module ablation (RQ2), and code-to-text migration (RQ3) across SWE-bench and OSWorld.
Modern agents increasingly succeed or fail because of the surrounding harness: the control stack that structures multi-step reasoning, tool use, memory, delegation, and stopping beyond any single model call. Research shows that externalized control patterns can be decisive — including reason-act loops (ReAct), retrieval-augmented generation (RAG), and explicit self-feedback (Reflexion). Recent work has expanded into explicit memory and self-evolution, workflow generation, multi-agent orchestration, and native tool execution.
Yet despite this growing importance, harness logic is rarely exposed as a coherent, portable artifact. In most agent systems, the effective harness is scattered across controller code, hidden framework defaults, tool adapters, and runtime-specific assumptions. As a result, harnesses are difficult to transfer across runtimes, hard to compare fairly, and hard to ablate cleanly. This shift reframes "prompt engineering" into the broader practice of context engineering: deciding what instructions, evidence, intermediate artifacts, and state should be available at each step of a long run.
Think of an AI agent like a skilled worker. The harness is the management system around that worker — it decides what tasks to assign, in what order, what tools are available, when to check results, and when to stop. For example, if you’re building an AI coding assistant, the harness might say: "First plan the approach, then write code, then run tests, and if tests fail, debug and retry." Today, this logic is typically buried deep inside code frameworks, making it nearly impossible to share, compare, or improve independently of the AI model itself.
Natural-language artifacts such as AGENTS.md and skill bundles show that practical systems can package repository-local conventions and reusable procedures in portable text. However, they typically attach local instructions or reusable routines without making harness-wide contracts, role boundaries, state semantics, and runtime-facing adapters first-class and jointly executable.
Thesis: We ask whether the design-pattern layer inside agent harnesses can be made explicit as an executable natural-language object under shared runtime assumptions. We propose Natural-Language Agent Harnesses (NLAHs) — a structured natural-language representation of harness control bound to explicit contracts and artifact carriers — and an Intelligent Harness Runtime (IHR) that interprets NLAHs directly.
A harness denotes the orchestration layer that governs multiple model or agent calls for a task family. The boundary between harness and runtime is analytical: generic services (tool adapters, sandboxing, child lifecycle) live in the runtime, while task-family policy (stages, artifact contracts, verifiers) lives in the harness. This boundary is made explicit for study.
How work is decomposed and scheduled across multiple steps, tools, and agents.
What artifacts must be produced, what gates must be satisfied, and when the run should stop.
What must persist across steps, branches, and delegated workers throughout the agent’s execution.
IHR is a shared runtime that interprets NLAHs directly. It cleanly separates the runtime charter (generic services: tool adapters, sandboxing, child lifecycle management) from the harness logic (task-family policy: stages, artifact contracts, verifiers). About 90% of computation happens in delegated child agents, with the runtime parent consuming under 10% of total resources.
Imagine IHR as a universal operating system for AI agents. Just as Windows or macOS can run any application, IHR can run any harness written in natural language. The key insight is the separation: the "runtime charter" provides basic services (like an OS provides file systems and networking), while the "harness skill" defines the specific workflow (like an application defines its own logic). This means you can swap harness strategies without changing the runtime — like installing a new app without reinstalling your OS.
NLAHs express harness behavior in editable natural language. The representation must expose: contracts (input/output requirements, stopping rules, permission boundaries), roles (planner, executor, verifier), stage structure (plan → execute → verify), adapters (tool interfaces), scripts (reusable references), state semantics (workspaces, manifests, path-addressable objects), and a failure taxonomy. IHR transforms three inputs: Backend → Codex, Runtime Charter → Runtime Skill, Harness Logic → Harness Skill.
File-backed state makes harness state persistent and inspectable. The canonical workspace includes TASK.md (run-level task statement), SKILL.md (normalized outcome), harness-skill/ directories (control logic and reusable scripts), history/ (session innovations), RESPONSE.md (child task output), and final artifacts. This approach turns opaque in-memory state into version-controlled, human-editable, cross-session persistent artifacts.
Does sharing a runtime alter agent behavior relative to native code harnesses? We compare Full IHR against ablations removing Runtime Skill (RTS) and Harness Skill (HS).
How do individual harness pattern modules contribute when composed? We test six modules: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
Can a code-centric harness be faithfully migrated to NLAH form? We convert OS-Symphony from Python code to natural-language harness skills.
Experiments use three benchmarks: SWE-bench Verified (125-sample subset, software engineering), Live-SWE (real-world GitHub issues), and OSWorld (36-sample subset, desktop computer use). Harness families include TRAE (coding), Live-SWE (coding), and OS-Symphony (computer-use). All experiments use Codex CLI with GPT-5.4 as the base model.
Full IHR produces significant changes in process metrics (tokens, calls, runtime) while maintaining comparable resolved rates. The trajectory-level evidence shows that Full IHR is not a prompt wrapper — about 90% of all resources are consumed by delegated child agents. The added budget reflects multi-stage exploration, candidate comparison, artifact handoff, and extra verification. Most SWE instances (110+ of 125) do not flip between Full IHR and ablations, meaning differences are concentrated in a small frontier of component-sensitive cases.
| Benchmark | Harness | Setting | Perf. | Prompt Tokens | Completion Tokens | Tool Calls | LLM Calls | Runtime (min) |
|---|---|---|---|---|---|---|---|---|
| SWE Verified | TRAE | Full IHR | 74.4 | 16.3M | 211k | 642.6 | 414.3 | 32.5 |
| w/o RTS | 76.0 | 11.1M | 137k | 451.9 | 260.5 | 16.6 | ||
| w/o HS | 75.2 | 1.2M | 13.6k | 51.1 | 34.0 | 6.7 | ||
| Live-SWE | Live-SWE | Full IHR | 72.8 | 1.4M | 17.0k | 58.4 | 41.4 | 7.6 |
| w/o RTS | 76.0 | 1.1M | 11.7k | 41.0 | 28.2 | 5.5 | ||
| w/o HS | 75.2 | 1.2M | 13.6k | 51.1 | 34.0 | 6.7 |
The runtime overhead is minimal: the runtime-owned parent thread consumes less than 10% of total resources across all metrics, with over 90% going to delegated child agents performing actual task work.
| Metric | Runtime-owned parent | Delegated child agents |
|---|---|---|
| Prompt tokens | 8.5% | 91.5% |
| Completion tokens | 8.1% | 91.9% |
| Tool calls | 9.8% | 90.2% |
| LLM calls | 9.4% | 90.6% |
Starting from a benchmark-specific Basic harness, each module is added one at a time. Self-Evolution achieves the highest individual gain on SWE Verified (+4.8 points, reaching 80.0%), while File-Backed State dominates on OSWorld (+5.5 points). Multi-Candidate Search and Verifier show negative effects in some benchmarks, suggesting these patterns require careful integration. The cost-performance analysis reveals that Self-Evolution offers the best performance-to-cost ratio.
The negative results for Verifier (-0.8 on SWE, -8.4 on OSWorld) and Multi-Candidate Search (-2.4 on SWE) are actually informative. The Verifier adds extra checking that can reject valid solutions if the verification criteria are too strict. Multi-Candidate Search generates multiple solution candidates but the comparison overhead can exceed the benefit. Think of it like a team: sometimes adding more reviewers slows down a project more than it improves quality.
| Benchmark | Basic | File-Backed State | Evidence-Backed Answering | Verifier | Self-Evolution | Multi-Candidate Search | Dynamic Orchestration |
|---|---|---|---|---|---|---|---|
| SWE Verified | 75.2 | 76.8+1.6 | 76.8+1.6 | 74.4-0.8 | 80.0+4.8 | 72.8-2.4 | 75.20.0 |
| OSWorld | 41.7 | 47.2+5.5 | 41.70.0 | 33.3-8.4 | 44.4+2.7 | 36.1-5.6 | 44.4+2.7 |
Migrating OS-Symphony from a Python code harness to NLAH form yielded dramatic improvements. The NLAH version not only matched but substantially outperformed the original code harness, while simultaneously reducing runtime and agent calls. This suggests that natural-language harnesses allow the LLM to leverage instructions more effectively than rigid code-based control flow.
This is perhaps the paper’s most surprising finding. When OS-Symphony’s harness was rewritten from Python code to natural language, performance jumped from 30.4% to 47.2% — a 55% improvement. The authors suggest that LLMs can more effectively leverage natural-language instructions than rigid code-based control flow. It’s like the difference between giving a skilled chef a detailed recipe in their native language vs. a flowchart with numbered steps — the natural-language version conveys intent and allows adaptive interpretation.
| Benchmark | Harness | Realization | Perf. | Prompt Tokens | Completion Tokens | Agent Calls | Tool Calls | LLM Calls | Runtime (min) |
|---|---|---|---|---|---|---|---|---|---|
| OSWorld | OS-Symphony | Code | 30.4 | 11.4M | 147.2k | 99 | 651 | 1.2k | 361.5 |
| NLAH | 47.2 | 15.7M | 228.5k | 72 | 683 | 34 | 140.8 |
The authors emphasize that natural language should not replace code. Instead, natural language carries editable high-level harness logic, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The scientific claim is about the unit of comparison: externalizing harness pattern logic as a readable, executable object under shared runtime semantics.
Externalizing harness logic as natural language turns harnesses from opaque code into inspectable, editable, and scientifically comparable objects.
A natural concern is whether stronger foundation models reduce the value of natural-language control. The results support a different interpretation: natural language remains important when used to specify harness-level control — roles, contracts, verification gates, durable state semantics, and delegation boundaries — rather than only one-shot prompt phrasing. This is consistent with practitioner accounts emphasizing context engineering and long-running harness design.
Once harnesses are explicit objects, they become a search space. Explicit harness modules can be manually designed, retrieved, migrated, recombined, and systematically ablated under shared assumptions. Longer term, this suggests automated search and optimization over harness representations, enabling harness engineering to become a more controlled scientific object.
This paper studied whether the harness design-pattern layer can be externalized as an executable, comparable, and ablatable object. Natural-Language Agent Harnesses and an Intelligent Harness Runtime were proposed, with three controlled experiments providing evidence:
These results suggest a path toward harness representation science, where harness modules become first-class research artifacts rather than incidental glue around models.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.