---
arxiv_id: 2603.25723
title: "Natural-Language Agent Harnesses"
authors:
  - Linyue Pan
  - Lexiao Zou
  - Shuo Guo
  - Jingchen Ni
  - Hai-Tao Zheng
difficulty: Advanced
tags:
  - Agent
  - LLM
published_at: 2026-03-26
flecto_url: https://flecto.zer0ai.dev/papers/2603.25723/
lang: en
---

Can agent control logic be externalized as a portable, executable natural-language artifact? This paper introduces NLAHs and a shared Intelligent Harness Runtime (IHR) that makes it possible — with controlled evidence across coding and computer-use benchmarks.

### Formulation

Formalizes the harness design-pattern layer as an explicit, portable representation object distinct from runtime policy.

### Shared Runtime

Introduces IHR , an in-loop LLM runtime that interprets harness logic directly while cleanly separating runtime charter from task logic.

### Controlled Evidence

Three controlled experiments on behavioral effect (RQ1), module ablation (RQ2), and code-to-text migration (RQ3) across SWE-bench and OSWorld.

## Introduction

Modern agents increasingly succeed or fail because of the surrounding harness : the control stack that structures multi-step reasoning, tool use, memory, delegation, and stopping beyond any single model call. Research shows that externalized control patterns can be decisive — including reason-act loops (ReAct), retrieval-augmented generation (RAG), and explicit self-feedback (Reflexion). Recent work has expanded into explicit memory and self-evolution, workflow generation, multi-agent orchestration, and native tool execution.

Yet despite this growing importance, harness logic is rarely exposed as a coherent, portable artifact. In most agent systems, the effective harness is scattered across controller code, hidden framework defaults, tool adapters, and runtime-specific assumptions . As a result, harnesses are difficult to transfer across runtimes, hard to compare fairly, and hard to ablate cleanly. This shift reframes "prompt engineering" into the broader practice of context engineering : deciding what instructions, evidence, intermediate artifacts, and state should be available at each step of a long run.

#### What is "Harness Engineering"?

Think of an AI agent like a skilled worker. The harness is the management system around that worker — it decides what tasks to assign, in what order, what tools are available, when to check results, and when to stop. For example, if you’re building an AI coding assistant, the harness might say: "First plan the approach, then write code, then run tests, and if tests fail, debug and retry." Today, this logic is typically buried deep inside code frameworks, making it nearly impossible to share, compare, or improve independently of the AI model itself.

Natural-language artifacts such as AGENTS.md and skill bundles show that practical systems can package repository-local conventions and reusable procedures in portable text. However, they typically attach local instructions or reusable routines without making harness-wide contracts, role boundaries, state semantics, and runtime-facing adapters first-class and jointly executable.

Thesis: We ask whether the design-pattern layer inside agent harnesses can be made explicit as an executable natural-language object under shared runtime assumptions. We propose Natural-Language Agent Harnesses (NLAHs) — a structured natural-language representation of harness control bound to explicit contracts and artifact carriers — and an Intelligent Harness Runtime (IHR) that interprets NLAHs directly.

## Methodology

### Harnesses and the Pattern Layer

A harness denotes the orchestration layer that governs multiple model or agent calls for a task family. The boundary between harness and runtime is analytical: generic services (tool adapters, sandboxing, child lifecycle) live in the runtime, while task-family policy (stages, artifact contracts, verifiers) lives in the harness. This boundary is made explicit for study.

How work is decomposed and scheduled across multiple steps, tools, and agents.

What artifacts must be produced, what gates must be satisfied, and when the run should stop.

What must persist across steps, branches, and delegated workers throughout the agent’s execution.

### Intelligent Harness Runtime (IHR)

IHR is a shared runtime that interprets NLAHs directly. It cleanly separates the runtime charter (generic services: tool adapters, sandboxing, child lifecycle management) from the harness logic (task-family policy: stages, artifact contracts, verifiers). About 90% of computation happens in delegated child agents, with the runtime parent consuming under 10% of total resources.

#### How IHR Works in Practice

Imagine IHR as a universal operating system for AI agents. Just as Windows or macOS can run any application, IHR can run any harness written in natural language. The key insight is the separation: the "runtime charter" provides basic services (like an OS provides file systems and networking), while the "harness skill" defines the specific workflow (like an application defines its own logic). This means you can swap harness strategies without changing the runtime — like installing a new app without reinstalling your OS.

### Natural-Language Agent Harnesses (NLAHs)

NLAHs express harness behavior in editable natural language. The representation must expose: contracts (input/output requirements, stopping rules, permission boundaries), roles (planner, executor, verifier), stage structure (plan → execute → verify), adapters (tool interfaces), scripts (reusable references), state semantics (workspaces, manifests, path-addressable objects), and a failure taxonomy . IHR transforms three inputs: Backend → Codex, Runtime Charter → Runtime Skill, Harness Logic → Harness Skill.

### File-Backed State

File-backed state makes harness state persistent and inspectable. The canonical workspace includes TASK.md (run-level task statement), SKILL.md (normalized outcome), harness-skill/ directories (control logic and reusable scripts), history/ (session innovations), RESPONSE.md (child task output), and final artifacts. This approach turns opaque in-memory state into version-controlled, human-editable, cross-session persistent artifacts.

## Experimental Design

#### Behavioral Effect

Does sharing a runtime alter agent behavior relative to native code harnesses? We compare Full IHR against ablations removing Runtime Skill (RTS) and Harness Skill (HS).

#### Module Ablation

How do individual harness pattern modules contribute when composed? We test six modules: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.

#### Code-to-Text Migration

Can a code-centric harness be faithfully migrated to NLAH form? We convert OS-Symphony from Python code to natural-language harness skills.

### Benchmarks & Setup

Experiments use three benchmarks: SWE-bench Verified (125-sample subset, software engineering), Live-SWE (real-world GitHub issues), and OSWorld (36-sample subset, desktop computer use). Harness families include TRAE (coding), Live-SWE (coding), and OS-Symphony (computer-use). All experiments use Codex CLI with GPT-5.4 as the base model.

## Results

### RQ1: Behavioral Effect

Full IHR produces significant changes in process metrics (tokens, calls, runtime) while maintaining comparable resolved rates. The trajectory-level evidence shows that Full IHR is not a prompt wrapper — about 90% of all resources are consumed by delegated child agents. The added budget reflects multi-stage exploration, candidate comparison, artifact handoff, and extra verification. Most SWE instances (110+ of 125) do not flip between Full IHR and ablations, meaning differences are concentrated in a small frontier of component-sensitive cases.

The runtime overhead is minimal: the runtime-owned parent thread consumes less than 10% of total resources across all metrics, with over 90% going to delegated child agents performing actual task work.

### RQ2: Harness Pattern Ablations

Starting from a benchmark-specific Basic harness, each module is added one at a time. Self-Evolution achieves the highest individual gain on SWE Verified (+4.8 points, reaching 80.0%), while File-Backed State dominates on OSWorld (+5.5 points). Multi-Candidate Search and Verifier show negative effects in some benchmarks, suggesting these patterns require careful integration. The cost-performance analysis reveals that Self-Evolution offers the best performance-to-cost ratio.

#### Why Do Some Modules Hurt Performance?

The negative results for Verifier (-0.8 on SWE, -8.4 on OSWorld) and Multi-Candidate Search (-2.4 on SWE) are actually informative. The Verifier adds extra checking that can reject valid solutions if the verification criteria are too strict. Multi-Candidate Search generates multiple solution candidates but the comparison overhead can exceed the benefit. Think of it like a team: sometimes adding more reviewers slows down a project more than it improves quality.

### RQ3: Code-to-Text Migration

Migrating OS-Symphony from a Python code harness to NLAH form yielded dramatic improvements. The NLAH version not only matched but substantially outperformed the original code harness, while simultaneously reducing runtime and agent calls. This suggests that natural-language harnesses allow the LLM to leverage instructions more effectively than rigid code-based control flow.

#### Why Does the Natural-Language Version Outperform Code?

This is perhaps the paper’s most surprising finding. When OS-Symphony’s harness was rewritten from Python code to natural language, performance jumped from 30.4% to 47.2% — a 55% improvement . The authors suggest that LLMs can more effectively leverage natural-language instructions than rigid code-based control flow. It’s like the difference between giving a skilled chef a detailed recipe in their native language vs. a flowchart with numbered steps — the natural-language version conveys intent and allows adaptive interpretation.

## Discussion

### Code vs. Natural Language

The authors emphasize that natural language should not replace code . Instead, natural language carries editable high-level harness logic, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The scientific claim is about the unit of comparison: externalizing harness pattern logic as a readable, executable object under shared runtime semantics.

Externalizing harness logic as natural language turns harnesses from opaque code into inspectable, editable, and scientifically comparable objects.

### Why Natural Language Still Matters

A natural concern is whether stronger foundation models reduce the value of natural-language control. The results support a different interpretation: natural language remains important when used to specify harness-level control — roles, contracts, verification gates, durable state semantics, and delegation boundaries — rather than only one-shot prompt phrasing. This is consistent with practitioner accounts emphasizing context engineering and long-running harness design.

### Searching Harness Representations

Once harnesses are explicit objects, they become a search space . Explicit harness modules can be manually designed, retrieved, migrated, recombined, and systematically ablated under shared assumptions. Longer term, this suggests automated search and optimization over harness representations, enabling harness engineering to become a more controlled scientific object.

### Limitations

- Natural language is less precise than code, and some harness mechanisms cannot be recovered faithfully from text, especially when they rely on hidden service-side state or proprietary schedulers.

- Runtime contamination remains a real risk: a strong shared runtime charter may absorb part of the behavior attributed to harness text.

- Module-level ablation is not strict causal identification; textual representations can introduce confounds such as instruction salience and prompt length.

## Related Work

#### Prompts as Programs

Several lines of work treat prompts and LLM calls as programmable objects — LMQL, DSPy, APPL, SGLang. These primarily program calls or pipelines; NLAHs focus on the harness layer governing multi-step agent calls, artifact contracts, and durable state.

#### Agent Control Patterns

Core patterns include ReAct, RAG, Reflexion, multi-agent generalists, workflow generation, and dynamic routing. NLAHs do not propose a new orchestration algorithm, but instead externalize the harness pattern logic as an executable representation.

#### NL to Workflows & Constraints

AutoFlow, FlowAgent, Agint, AgentSpec, and ContextCov translate natural language into workflows or constraints. Unlike compiling to a runtime-owned IR, IHR interprets harness logic directly with explicit contracts and durable artifacts.

#### Reusable Skills & Harness Engineering

AGENTS.md, AgentSkills, AgentSkillOS, SkillsBench, and AutoHarness treat skills and harnesses as first-class objects. NLAHs extend this from reusable local guidance to executable harness-level control under a shared runtime.

## Conclusion

This paper studied whether the harness design-pattern layer can be externalized as an executable, comparable, and ablatable object. Natural-Language Agent Harnesses and an Intelligent Harness Runtime were proposed, with three controlled experiments providing evidence:

- RQ1: The IHR stack is operationally viable — Full IHR matches native code harness performance while enabling richer multi-stage exploration.

- RQ2: Individual modules compose effectively, with Self-Evolution (+4.8 on SWE Verified) and File-Backed State (+5.5 on OSWorld) as standout contributors.

- RQ3: Code-to-text migration improves both performance (+55%) and efficiency (-61% runtime), demonstrating the practical viability of NLAH representations.

These results suggest a path toward harness representation science , where harness modules become first-class research artifacts rather than incidental glue around models.

- AGENTS.md (2026). AGENTS.md specification.

- AgentSkills (2026). AgentSkills: Portable skill bundles for agents.

- An et al. (2025). Scaffold-aware agent evaluation.

- Anthropic (2024, 2025a,b,c, 2026a,b). Context engineering and agent harness design.

- Beurer-Kellner et al. (2023). LMQL: Constraints and control flow for prompting.

- Bui (2026). Harness engineering for long-running agents.

- Cao et al. (2024). Prompt engineering diminishing returns.

- Chen et al. (2026a). PinchBench: Practical skill invocation benchmark.

- Chen et al. (2026b). Promptware engineering.

- Cheng et al. (2025). State sharing between prompts and programs.

- Chivukula et al. (2025). Agint: Compiling SE agents into agentic graphs.

- Chroma Research (2025). Context folding performance analysis.

- Costa (2026). Multi-agent orchestration.

- Ding et al. (2026). Scaffold-aware evaluation methodology.

- Dong et al. (2025). APPL: Integrating prompts and Python programs.

- Fourney et al. (2024). Multi-agent generalists.

- Hao et al. (2026). Experience-driven skill creation.

- HKUDS (2026). Native tool execution.

- Ke et al. (2026). Dynamic topology routing.

- Khattab et al. (2024). DSPy: Declarative LM pipelines.

- Lewis et al. (2021). Retrieval-augmented generation.

- Li (2026). Skills vs multi-agent communication.

- Li et al. (2024). AutoFlow: Workflow generation.

- Li et al. (2026a). AgentSkillOS.

- Li et al. (2026b). SkillsBench and SkillCraft.

- Liang et al. (2025). Prompts as programs.

- Liu et al. (2024). Context folding bottlenecks.

- Lou et al. (2026). AutoHarness: Automatic harness synthesis.

- Mi et al. (2026). Reusable procedural memory.

- Muennighoff et al. (2025). Interface-level test-time scaling.

- OpenAI (2026a). Harness as first-class systems object.

- OpenClaw (2026). Lobster: Workflow specification system.

- OpenProse (2026). Natural-language workflow authoring.

- PinchBench (2026). Practical skill evaluation.

- Sharma (2026). ContextCov: Executable constraints from agent instructions.

- Shi et al. (2025). FlowAgent: Compliance vs flexibility.

- Shinn et al. (2023). Reflexion: Self-feedback.

- Su et al. (2026). Long interaction history compression.

- Sun et al. (2025). Context folding for long-horizon agents.

- Tang et al. (2025, 2026a,b). Context engineering research.

- Wang et al. (2024a). Prompt engineering brittleness.

- Wang et al. (2024b). Native tool execution.

- Wang et al. (2025a). AgentSpec: Runtime enforcement.

- Wang et al. (2025b,c). Dynamic routing and multi-agent orchestration.

- Xia et al. (2025, 2026). Memory and self-evolution.

- Yao et al. (2023). ReAct: Reason-act loops.

- Ye et al. (2026). Context-engineering skill evolution.

- Yue et al. (2025). Dynamic topology routing.

- Zhan et al. (2026a,b). Scaffold-aware evaluation.

- Zhang et al. (2025). General Modular Harness.

- Zhang et al. (2026). Memory and self-evolution.

- Zheng et al. (2024). SGLang: Structured LM programs.

- Zheng et al. (2025). Workflow generation.
