Natural-Language Agent Harnesses

Introduction

Modern agents increasingly succeed or fail because of the surrounding harness: the control stack that structures multi-step reasoning, tool use, memory, delegation, and stopping beyond any single model call. Research shows that externalized control patterns can be decisive — including reason-act loops (ReAct), retrieval-augmented generation (RAG), and explicit self-feedback (Reflexion). Recent work has expanded into explicit memory and self-evolution, workflow generation, multi-agent orchestration, and native tool execution.

Yet despite this growing importance, harness logic is rarely exposed as a coherent, portable artifact. In most agent systems, the effective harness is scattered across controller code, hidden framework defaults, tool adapters, and runtime-specific assumptions. As a result, harnesses are difficult to transfer across runtimes, hard to compare fairly, and hard to ablate cleanly. This shift reframes "prompt engineering" into the broader practice of context engineering: deciding what instructions, evidence, intermediate artifacts, and state should be available at each step of a long run.

What is "Harness Engineering"?

Think of an AI agent like a skilled worker. The harness is the management system around that worker — it decides what tasks to assign, in what order, what tools are available, when to check results, and when to stop. For example, if you’re building an AI coding assistant, the harness might say: "First plan the approach, then write code, then run tests, and if tests fail, debug and retry." Today, this logic is typically buried deep inside code frameworks, making it nearly impossible to share, compare, or improve independently of the AI model itself.

Figure 1: Harness design patterns — **Figure 1:** Examples of harness design patterns used by modern agents — planning, memory, flow, reflexion, RAG, ReAct, orchestration, native CLI, test-time scaling, self-evolving, and subagents.

Natural-language artifacts such as AGENTS.md and skill bundles show that practical systems can package repository-local conventions and reusable procedures in portable text. However, they typically attach local instructions or reusable routines without making harness-wide contracts, role boundaries, state semantics, and runtime-facing adapters first-class and jointly executable.

Thesis: We ask whether the design-pattern layer inside agent harnesses can be made explicit as an executable natural-language object under shared runtime assumptions. We propose Natural-Language Agent Harnesses (NLAHs) — a structured natural-language representation of harness control bound to explicit contracts and artifact carriers — and an Intelligent Harness Runtime (IHR) that interprets NLAHs directly.

Methodology

Harnesses and the Pattern Layer

A harness denotes the orchestration layer that governs multiple model or agent calls for a task family. The boundary between harness and runtime is analytical: generic services (tool adapters, sandboxing, child lifecycle) live in the runtime, while task-family policy (stages, artifact contracts, verifiers) lives in the harness. This boundary is made explicit for study.

01

Control

How work is decomposed and scheduled across multiple steps, tools, and agents.

02

Contracts

What artifacts must be produced, what gates must be satisfied, and when the run should stop.

03

State

What must persist across steps, branches, and delegated workers throughout the agent’s execution.

Harness vs. Context Engineering: Context engineering means crafting the right prompt for a single AI call. A harness goes further — it manages the entire multi-step workflow, like a project manager coordinating a team across days of work, not just writing a single email.

Intelligent Harness Runtime (IHR)

IHR is a shared runtime that interprets NLAHs directly. It cleanly separates the runtime charter (generic services: tool adapters, sandboxing, child lifecycle management) from the harness logic (task-family policy: stages, artifact contracts, verifiers). About 90% of computation happens in delegated child agents, with the runtime parent consuming under 10% of total resources.

Figure 2: Comparison of harness designs — **Figure 2:** Comparison of harness designs. *Left:* Traditional code-coupled harness with logic buried in Python. *Center:* Natural-Language Harness with explicit contracts, gates, and stages in editable text. *Right:* File-Backed State Module providing multi-addressable, cross-session, human-editable, versionable state.

How IHR Works in Practice

Imagine IHR as a universal operating system for AI agents. Just as Windows or macOS can run any application, IHR can run any harness written in natural language. The key insight is the separation: the "runtime charter" provides basic services (like an OS provides file systems and networking), while the "harness skill" defines the specific workflow (like an application defines its own logic). This means you can swap harness strategies without changing the runtime — like installing a new app without reinstalling your OS.

Natural-Language Agent Harnesses (NLAHs)

NLAHs express harness behavior in editable natural language. The representation must expose: contracts (input/output requirements, stopping rules, permission boundaries), roles (planner, executor, verifier), stage structure (plan → execute → verify), adapters (tool interfaces), scripts (reusable references), state semantics (workspaces, manifests, path-addressable objects), and a failure taxonomy. IHR transforms three inputs: Backend → Codex, Runtime Charter → Runtime Skill, Harness Logic → Harness Skill.

Figure 3: Three-layer transformation — **Figure 3:** The three-layer transformation in IHR. Backend maps to Codex (execution environment), Runtime Charter maps to Runtime Skill (shared services), and Harness Logic maps to Harness Skill (task-family control).

Three-Layer Transformation: The NLAH system converts three inputs into executable components: (1) the model backend becomes a Codex execution environment, (2) the runtime charter becomes reusable Runtime Skills, and (3) the harness logic becomes Harness Skills — all expressed in plain, editable text files rather than compiled code.

File-Backed State

File-backed state makes harness state persistent and inspectable. The canonical workspace includes TASK.md (run-level task statement), SKILL.md (normalized outcome), harness-skill/ directories (control logic and reusable scripts), history/ (session innovations), RESPONSE.md (child task output), and final artifacts. This approach turns opaque in-memory state into version-controlled, human-editable, cross-session persistent artifacts.

Multi-Addressable Cross-Session Human-Editable Versionable

Experimental Design

RQ1

Behavioral Effect

Does sharing a runtime alter agent behavior relative to native code harnesses? We compare Full IHR against ablations removing Runtime Skill (RTS) and Harness Skill (HS).

RQ2

Module Ablation

How do individual harness pattern modules contribute when composed? We test six modules: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.

RQ3

Code-to-Text Migration

Can a code-centric harness be faithfully migrated to NLAH form? We convert OS-Symphony from Python code to natural-language harness skills.

Benchmarks & Setup

Experiments use three benchmarks: SWE-bench Verified (125-sample subset, software engineering), Live-SWE (real-world GitHub issues), and OSWorld (36-sample subset, desktop computer use). Harness families include TRAE (coding), Live-SWE (coding), and OS-Symphony (computer-use). All experiments use Codex CLI with GPT-5.4 as the base model.

Results

RQ1: Behavioral Effect

Full IHR produces significant changes in process metrics (tokens, calls, runtime) while maintaining comparable resolved rates. The trajectory-level evidence shows that Full IHR is not a prompt wrapper — about 90% of all resources are consumed by delegated child agents. The added budget reflects multi-stage exploration, candidate comparison, artifact handoff, and extra verification. Most SWE instances (110+ of 125) do not flip between Full IHR and ablations, meaning differences are concentrated in a small frontier of component-sensitive cases.

Table 1: RQ1 — Performance and process metrics across SWE-bench Verified and Live-SWE benchmarks
Benchmark	Harness	Setting	Perf.	Prompt Tokens	Completion Tokens	Tool Calls	LLM Calls	Runtime (min)
SWE Verified	TRAE	Full IHR	74.4	16.3M	211k	642.6	414.3	32.5
		w/o RTS	76.0	11.1M	137k	451.9	260.5	16.6
		w/o HS	75.2	1.2M	13.6k	51.1	34.0	6.7
Live-SWE	Live-SWE	Full IHR	72.8	1.4M	17.0k	58.4	41.4	7.6
		w/o RTS	76.0	1.1M	11.7k	41.0	28.2	5.5
		w/o HS	75.2	1.2M	13.6k	51.1	34.0	6.7

Why does Full IHR use more resources but achieve similar scores? The key insight is that Full IHR doesn’t just add overhead — it restructures how the agent works. It creates multi-stage exploration with candidate comparison and verification, like a thorough engineer who checks their work vs. a quick fixer. The extra cost buys quality in hard cases, even if the average score looks similar.

The runtime overhead is minimal: the runtime-owned parent thread consumes less than 10% of total resources across all metrics, with over 90% going to delegated child agents performing actual task work.

Table 4: Runtime overhead — Resource distribution between runtime parent and child agents
Metric	Runtime-owned parent	Delegated child agents
Prompt tokens	8.5%	91.5%
Completion tokens	8.1%	91.9%
Tool calls	9.8%	90.2%
LLM calls	9.4%	90.6%

RQ2: Harness Pattern Ablations

Starting from a benchmark-specific Basic harness, each module is added one at a time. Self-Evolution achieves the highest individual gain on SWE Verified (+4.8 points, reaching 80.0%), while File-Backed State dominates on OSWorld (+5.5 points). Multi-Candidate Search and Verifier show negative effects in some benchmarks, suggesting these patterns require careful integration. The cost-performance analysis reveals that Self-Evolution offers the best performance-to-cost ratio.

Why Do Some Modules Hurt Performance?

The negative results for Verifier (-0.8 on SWE, -8.4 on OSWorld) and Multi-Candidate Search (-2.4 on SWE) are actually informative. The Verifier adds extra checking that can reject valid solutions if the verification criteria are too strict. Multi-Candidate Search generates multiple solution candidates but the comparison overhead can exceed the benefit. Think of it like a team: sometimes adding more reviewers slows down a project more than it improves quality.

Table 3: RQ2 — Module composition and ablation. Each module is added to the Basic starting point independently.
Benchmark	Basic	File-Backed State	Evidence-Backed Answering	Verifier	Self-Evolution	Multi-Candidate Search	Dynamic Orchestration
SWE Verified	75.2	76.8_+1.6	76.8_+1.6	74.4_-0.8	80.0_+4.8	72.8_-2.4	75.2_0.0
OSWorld	41.7	47.2_+5.5	41.7_0.0	33.3_-8.4	44.4_+2.7	36.1_-5.6	44.4_+2.7

Figure 5: Score-cost view and complementarity analysis — **Figure 5:** *(a)* Score-Cost View showing each module’s resolved rate vs. estimated API cost per sample. Self-Evolution achieves ~80% at moderate cost. *(b)* Complementarity with Basic — bars show Basic+module performance, dots show module-alone performance. File-Backed State and Evidence-Backed Answering are the most complementary modules.

RQ3: Code-to-Text Migration

Migrating OS-Symphony from a Python code harness to NLAH form yielded dramatic improvements. The NLAH version not only matched but substantially outperformed the original code harness, while simultaneously reducing runtime and agent calls. This suggests that natural-language harnesses allow the LLM to leverage instructions more effectively than rigid code-based control flow.

Why Does the Natural-Language Version Outperform Code?

This is perhaps the paper’s most surprising finding. When OS-Symphony’s harness was rewritten from Python code to natural language, performance jumped from 30.4% to 47.2% — a 55% improvement. The authors suggest that LLMs can more effectively leverage natural-language instructions than rigid code-based control flow. It’s like the difference between giving a skilled chef a detailed recipe in their native language vs. a flowchart with numbered steps — the natural-language version conveys intent and allows adaptive interpretation.

0

%

Performance gain

0

%

Runtime reduction

0

%

Fewer agent calls

Table 5: RQ3 — Code vs. NLAH realization of OS-Symphony on OSWorld
Benchmark	Harness	Realization	Perf.	Prompt Tokens	Completion Tokens	Agent Calls	Tool Calls	LLM Calls	Runtime (min)
OSWorld	OS-Symphony	Code	30.4	11.4M	147.2k	99	651	1.2k	361.5
OSWorld	OS-Symphony	NLAH	47.2	15.7M	228.5k	72	683	34	140.8

Discussion

Code vs. Natural Language

The authors emphasize that natural language should not replace code. Instead, natural language carries editable high-level harness logic, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The scientific claim is about the unit of comparison: externalizing harness pattern logic as a readable, executable object under shared runtime semantics.

Externalizing harness logic as natural language turns harnesses from opaque code into inspectable, editable, and scientifically comparable objects.

Why Natural Language Still Matters

A natural concern is whether stronger foundation models reduce the value of natural-language control. The results support a different interpretation: natural language remains important when used to specify harness-level control — roles, contracts, verification gates, durable state semantics, and delegation boundaries — rather than only one-shot prompt phrasing. This is consistent with practitioner accounts emphasizing context engineering and long-running harness design.

Searching Harness Representations

Once harnesses are explicit objects, they become a search space. Explicit harness modules can be manually designed, retrieved, migrated, recombined, and systematically ablated under shared assumptions. Longer term, this suggests automated search and optimization over harness representations, enabling harness engineering to become a more controlled scientific object.

The Harness Search Space: This is a forward-looking insight. If harnesses are explicit text objects, they could potentially be automatically optimized — similar to how neural architecture search finds optimal model structures, we could search for optimal harness configurations. This would move agent engineering from manual craft to systematic science.

Limitations

Natural language is less precise than code, and some harness mechanisms cannot be recovered faithfully from text, especially when they rely on hidden service-side state or proprietary schedulers.
Runtime contamination remains a real risk: a strong shared runtime charter may absorb part of the behavior attributed to harness text.
Module-level ablation is not strict causal identification; textual representations can introduce confounds such as instruction salience and prompt length.

Related Work

Conclusion

This paper studied whether the harness design-pattern layer can be externalized as an executable, comparable, and ablatable object. Natural-Language Agent Harnesses and an Intelligent Harness Runtime were proposed, with three controlled experiments providing evidence:

RQ1: The IHR stack is operationally viable — Full IHR matches native code harness performance while enabling richer multi-stage exploration.
RQ2: Individual modules compose effectively, with Self-Evolution (+4.8 on SWE Verified) and File-Backed State (+5.5 on OSWorld) as standout contributors.
RQ3: Code-to-text migration improves both performance (+55%) and efficiency (-61% runtime), demonstrating the practical viability of NLAH representations.

These results suggest a path toward harness representation science, where harness modules become first-class research artifacts rather than incidental glue around models.

References (53 citations)

AGENTS.md (2026). AGENTS.md specification.
AgentSkills (2026). AgentSkills: Portable skill bundles for agents.
An et al. (2025). Scaffold-aware agent evaluation.
Anthropic (2024, 2025a,b,c, 2026a,b). Context engineering and agent harness design.
Beurer-Kellner et al. (2023). LMQL: Constraints and control flow for prompting.
Bui (2026). Harness engineering for long-running agents.
Cao et al. (2024). Prompt engineering diminishing returns.
Chen et al. (2026a). PinchBench: Practical skill invocation benchmark.
Chen et al. (2026b). Promptware engineering.
Cheng et al. (2025). State sharing between prompts and programs.
Chivukula et al. (2025). Agint: Compiling SE agents into agentic graphs.
Chroma Research (2025). Context folding performance analysis.
Costa (2026). Multi-agent orchestration.
Ding et al. (2026). Scaffold-aware evaluation methodology.
Dong et al. (2025). APPL: Integrating prompts and Python programs.
Fourney et al. (2024). Multi-agent generalists.
Hao et al. (2026). Experience-driven skill creation.
HKUDS (2026). Native tool execution.
Ke et al. (2026). Dynamic topology routing.
Khattab et al. (2024). DSPy: Declarative LM pipelines.
Lewis et al. (2021). Retrieval-augmented generation.
Li (2026). Skills vs multi-agent communication.
Li et al. (2024). AutoFlow: Workflow generation.
Li et al. (2026a). AgentSkillOS.
Li et al. (2026b). SkillsBench and SkillCraft.
Liang et al. (2025). Prompts as programs.
Liu et al. (2024). Context folding bottlenecks.
Lou et al. (2026). AutoHarness: Automatic harness synthesis.
Mi et al. (2026). Reusable procedural memory.
Muennighoff et al. (2025). Interface-level test-time scaling.
OpenAI (2026a). Harness as first-class systems object.
OpenClaw (2026). Lobster: Workflow specification system.
OpenProse (2026). Natural-language workflow authoring.
PinchBench (2026). Practical skill evaluation.
Sharma (2026). ContextCov: Executable constraints from agent instructions.
Shi et al. (2025). FlowAgent: Compliance vs flexibility.
Shinn et al. (2023). Reflexion: Self-feedback.
Su et al. (2026). Long interaction history compression.
Sun et al. (2025). Context folding for long-horizon agents.
Tang et al. (2025, 2026a,b). Context engineering research.
Wang et al. (2024a). Prompt engineering brittleness.
Wang et al. (2024b). Native tool execution.
Wang et al. (2025a). AgentSpec: Runtime enforcement.
Wang et al. (2025b,c). Dynamic routing and multi-agent orchestration.
Xia et al. (2025, 2026). Memory and self-evolution.
Yao et al. (2023). ReAct: Reason-act loops.
Ye et al. (2026). Context-engineering skill evolution.
Yue et al. (2025). Dynamic topology routing.
Zhan et al. (2026a,b). Scaffold-aware evaluation.
Zhang et al. (2025). General Modular Harness.
Zhang et al. (2026). Memory and self-evolution.
Zheng et al. (2024). SGLang: Structured LM programs.
Zheng et al. (2025). Workflow generation.

Natural-Language Agent Harnesses

Formulation

Shared Runtime

Controlled Evidence

Introduction

What is "Harness Engineering"?

Methodology

Harnesses and the Pattern Layer

Control

Contracts

State

Intelligent Harness Runtime (IHR)

How IHR Works in Practice

Natural-Language Agent Harnesses (NLAHs)

File-Backed State

Experimental Design

Behavioral Effect

Module Ablation

Code-to-Text Migration

Benchmarks & Setup

Results

RQ1: Behavioral Effect

RQ2: Harness Pattern Ablations

Why Do Some Modules Hurt Performance?

RQ3: Code-to-Text Migration

Why Does the Natural-Language Version Outperform Code?

Discussion

Code vs. Natural Language

Why Natural Language Still Matters

Searching Harness Representations

Limitations

Conclusion

Natural-Language Agent Harnesses

Formulation

Shared Runtime

Controlled Evidence

Introduction

What is "Harness Engineering"?

Methodology

Harnesses and the Pattern Layer

Control

Contracts

State

Intelligent Harness Runtime (IHR)

How IHR Works in Practice

Natural-Language Agent Harnesses (NLAHs)

File-Backed State

Experimental Design

Behavioral Effect

Module Ablation

Code-to-Text Migration

Benchmarks & Setup

Results

RQ1: Behavioral Effect

RQ2: Harness Pattern Ablations

Why Do Some Modules Hurt Performance?

RQ3: Code-to-Text Migration

Why Does the Natural-Language Version Outperform Code?

Discussion

Code vs. Natural Language

Why Natural Language Still Matters

Searching Harness Representations

Limitations

Related Work

Prompts as Programs

Agent Control Patterns

NL to Workflows & Constraints

Reusable Skills & Harness Engineering

Conclusion