Toward Autonomous Long-Horizon Engineering for ML Research

Abstract

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system built on a simple principle — strong long-horizon performance requires both structured orchestration and durable state continuity. AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs — yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that the File-as-Bus protocol is a key driver of performance: removing it reduces PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

Three Ideas at a Glance

⚖️

Thin Control over Thick State

The Orchestrator keeps only a small, stable context across stage transitions. All the heavy project state — paper analyses, plans, code, logs, experiment records — lives outside the conversation, in a permission-scoped shared workspace.

🚌

File-as-Bus Coordination

Agents coordinate by reading and writing role-aligned artifact regions (paper_analysis/, submission/, append-only logs) rather than by chatting. Removing this bus drops MLE-Bench Lite Any Medal by 31.82 points.

📈

Benchmarks Back It Up

PaperBench improves by 10.54 points on average over the best matched baseline; MLE-Bench Lite reaches 81.82 Any Medal%. Gains hold under both Gemini-3-Flash and GLM-5 backbones.

Why this matters for building products: think of a team where a senior PM keeps only a short status dashboard (thin control) while engineers commit everything substantial to shared repos, design docs, and dashboards (thick state). AiScientist applies the same pattern to a long-running autonomous ML run — making it the difference between "got a first pipeline running" and "got a Gold medal after 20 rounds of refinement."

1. Introduction

Automating scientific research has emerged as one of the most ambitious goals in artificial intelligence. In ML specifically, progress could substantially accelerate discovery, improve reproducibility, and broaden access to high-quality research systems.

Within this broader agenda, we focus on a more operationally demanding setting: autonomous long-horizon engineering for ML research. An agent must own the end-to-end technical work of building, running, and iteratively improving ML research systems over hours or days. Concretely, the difficulty arises from underspecification, system setup burden, delayed and often confounded experimental feedback, and the need to maintain state continuity across iterations.

We address this challenge with AiScientist. Our central design principle is to treat long-horizon performance as a joint problem of orchestration and state continuity: agents must not only coordinate work across heterogeneous stages, but also preserve evolving project state with enough fidelity for later decisions to build on it safely.

Our Contributions

We introduce AiScientist, a system for autonomous long-horizon ML research engineering that supports the full loop from interpreting research specifications and setting up runnable systems to experimentation and iterative refinement.
We propose an artifact-mediated coordination design that combines hierarchical orchestration with a permission-scoped shared workspace. Through a File-as-Bus protocol, the system preserves state continuity via durable artifacts while keeping top-level control lightweight.
We evaluate AiScientist on PaperBench and MLE-Bench Lite, showing substantial gains over strong baselines and yielding empirical insights into long-horizon ML research engineering — especially the importance of durable state continuity.

2. Task Formulation

We formulate the long-horizon ML research engineering task through PaperBench. Given a research paper P, a bare Docker environment E with GPU access, and a time budget T, the agent must produce a runnable submission that reproduces the paper's target metrics.

Four challenges that make this hard

Underspecification

Research specifications are typically incomplete rather than complete blueprints. Important implementation details are often implicit, scattered across sections, or omitted entirely, so the agent must recover them from sparse cues.

System Setup Burden

Success depends on substantial setup work beyond algorithmic implementation: configuring environments, acquiring datasets and models from permitted sources, and integrating these resources into a runnable system.

Delayed Feedback

Meaningful evidence arrives only after experiments run. Discrepancies may stem from interpretation, implementation, data processing, or infrastructure, so the agent must reason from delayed and often confounded signals.

State Continuity

Each round produces code, configurations, logs, results, and diagnostic evidence that later decisions must correctly interpret and build on. Progress depends on maintaining continuity across iterations.

3. AiScientist: an Artifact-Mediated Research Lab

AiScientist is built around a simple systems view of long-horizon ML research engineering: strong performance depends not only on decomposing work into the right stages, but also on preserving evolving project state with enough fidelity that later decisions can build on it safely. A Tier-0 Orchestrator keeps thin stage-level control through concise summaries and a workspace map, while Tier-1 specialists and optional Tier-2 subagents operate over a shared, permission-scoped workspace.

Figure 2. Architecture of AiScientist, an artifact-mediated research lab. A Tier-0 Orchestrator keeps thin control through stage-level directives, concise summaries, and a workspace map, while Tier-1 specialists and optional Tier-2 subagents coordinate work by reading and writing durable artifacts in a permission-scoped workspace.

3.1. Thin Control over Thick State

The core systems principle of AiScientist is thin control over thick state. Here, thin refers to the small, stable context needed to make routine control decisions, whereas thick refers to the much larger body of externalized project state that must remain faithful across iterations.

Let W_t denote the workspace at step t. A lightweight workspace map m_t is maintained as a textual index of the major artifact regions and their roles.

Formal formulation

Workspace map: a textual index of artifact regions — a navigational interface to the workspace, not a lossy replacement.

$$m_t = \mathcal{M}(W_t)\tag{1}$$

Orchestrator action selection: the top-level controller chooses the next action using the stage-level context and the workspace map, optionally inspecting the workspace on demand.

$$a_t = \mathcal{T}_0(c_t, m_t, G; W_t),\ a_t \in \mathcal{T}_0 \cup \mathcal{A}_1 \tag{2}$$

Specialist invocation: a specialist runs its own local loop and returns only a concise summary together with workspace updates.

$$(s_t, \Delta W_t) = \tau_j(d_t, m_t; W_t),\quad c_{t+1} = c_t \oplus s_t,\quad W_{t+1} = U(W_t, \Delta W_t) \tag{3}$$

Reading the equations without a math background

Equations (1)–(3) look intimidating, but each one describes a very concrete everyday move:

(1) m = M(W): automatically generate a table of contents for the entire project folder. The table is short; the folder is huge.
(2) Orchestrator picks action: the boss looks at the table of contents and a short status note, then decides "run shell command" or "call the Paper Comprehension specialist." No long chat history needed.
(3) Specialist returns: the specialist does its internal work, writes or edits files, and hands back a two-line summary plus a list of changed files. The boss only absorbs the summary.

The notation hides it, but the takeaway is operational: the boss never reads a novel-length transcript; they read a TOC and a bullet summary.

This design realizes progressive disclosure at the level of control and handoff. The Orchestrator carries only a small, stable control interface across stage transitions, making decisions from summaries and the map rather than from accumulated chat history.

3.2. File-as-Bus Coordination

AiScientist implements artifact-mediated coordination through a File-as-Bus protocol. In long-horizon ML research engineering, the critical intermediate state is already naturally file-valued: paper analyses, environment-setup scripts, resource-download manifests, training code, checkpoints, evaluation metrics, and append-only execution logs. The workspace is treated as the system of record rather than passive storage.

Three role-aligned artifact regions

paper_analysis/

Structured paper understanding: target metrics, ambiguities, and implementation-relevant details produced by the Paper Comprehension specialist.

submission/

The runnable reproduction repository: code, training scripts, config, and a reproduce.sh entry point, written by Implementation and Experimentation specialists.

impl_log.md · exp_log.md

Append-only execution traces. Every attempt — failures, partial successes, metric gaps — is recorded so later decisions can inspect the exact history instead of relying on inherited conversational context.

Because state lives in durable artifacts rather than in a shared chat, agents can re-enter the project from the current workspace state, read task-relevant regions, and act without needing the full prior discourse.

Analogy: Slack vs. Git

Most multi-agent systems coordinate through something like Slack threads — messages and replies pile up until nobody can find anything. File-as-Bus coordinates more like Git + a shared repo: every agent reads the current files, makes its change, writes it back, and the next agent picks up from that file — not from 400 messages of chat history. That's why it scales to 24-hour autonomous runs where older agents would lose the plot.

3.3. Hierarchical Orchestration via Agent-as-Tool

If File-as-Bus coordination provides the state substrate, hierarchical orchestration provides the control mechanism. The Orchestrator routes the right expertise to the right stage — from comprehension to planning, implementation, experimentation, and debugging — while specialists run their own local loops and Tier-2 subagents handle focused leaf subtasks.

The key design choice is Agent-as-Tool: each specialist is exposed to the Orchestrator through the same callable interface as ordinary tools such as shell execution, file inspection, or web search. Delegation is an action within the Orchestrator's normal tool-use loop, and each specialist's private context is re-initialized at each invocation so detailed reasoning does not accumulate at the top level.

Tier-1 specialists

Paper Comprehension

Transforms the paper into implementation details, target metrics, and uncertainty notes. Can coordinate multiple subagents so that independent analytical dimensions are processed in parallel before synthesis.

Prioritization

Converts paper understanding into an ordered execution contract: identifies dependencies, ranks milestones by impact and feasibility, and writes the plan to prioritized_tasks.md.

Implementation

Turns plans and failure reports into code. Full mode builds the reproduction repo from analysis artifacts and the prioritized plan. Fix mode patches the existing codebase in response to directives.

Experimentation

Executes the end-to-end pipeline, compares produced metrics against the paper's targets, and records results and unresolved issues in exp_log.md.

Generic Helper

Spawns lightweight helpers for auxiliary subtasks such as exploration, planning, or one-off operations that do not warrant a dedicated specialist workflow.

Tier-2 subagents are tightly scoped leaf workers created within a specialist's local horizon — structure extraction, algorithm and baseline analysis, environment setup, resource download, and exploratory investigation.

3.4. Evidence-Driven Research-Engineering Loop

AiScientist runs an evidence-driven research-engineering loop over the evolving workspace rather than a rigid one-pass pipeline. Early in the trajectory, the Orchestrator emphasizes paper comprehension and prioritization so that implementation proceeds from a clear, ranked plan. Once a scaffold exists, the dominant pattern becomes iterative alternation between implementation and experimentation.

This loop is driven by executable evidence. Experimental runs produce failure traces, partial successes, metric gaps, and resource bottlenecks that determine what should be built, fixed, or tested next. These outputs are written back as durable artifacts so future iterations can reason about them precisely.

Validation AUC progression across ~80 autonomous experiments — Figure 1. Teaser. Across roughly 80 autonomous experiments on a single PaperBench task, AiScientist progressively raises Validation AUC from ~0.91 to ~0.98 through incremental choices — adding DistilBERT, Ridge meta-learner, multi-seed stacking, CatBoost with isotonic calibration, and pseudo-label augmentation — rather than a single lucky jump.

4.1. Experimental Setup

Benchmarks

PaperBench

From-scratch replication of top-tier conference papers. Evaluates whether an agent can go from an unfamiliar paper to a working reproduction inside a bare Docker environment.

MLE-Bench Lite

Kaggle-style ML competitions that emphasize sustained experiment improvement. Success requires not one submission but many rounds of iterative refinement.

Baselines

On PaperBench we compare against BasicAgent and IterativeAgent under the same evaluation protocol. On MLE-Bench Lite we report controlled comparisons against strong autonomous ML engineering systems (AIDE, LoongNow, ML-Master) and additionally include official leaderboard results as contextual reference.

Implementation Details

We instantiate AiScientist with two backbone LLMs: Gemini-3-Flash and GLM-5. Across both benchmarks, each run is allocated one H20 GPU and a 24-hour budget per task.

One H20 GPU, 24 hours per task is a very realistic budget — roughly what a solo ML engineer could dedicate to a single paper replication over a working day. The gains therefore aren't "buy infinity GPUs and hope" — they come from smarter system design at a fixed compute budget.

4.2. Main Results on PaperBench

Across both backbones, AiScientist improves over the best matched baseline by 9.92 and 11.15 points, respectively. It also narrows the gap to the reported human baseline of 41% after 48 hours of effort.

Compared with IterativeAgent, AiScientist attains substantially higher average scores at much lower cost per task: $15.67 vs $27.44 under Gemini-3-Flash, and $12.20 vs $54.90 under GLM-5. The gain is not coming from simply buying more interaction time.

Table 1 — PaperBench main results — Table 1. Main results on PaperBench. Δ shows AiScientist's gain over the best baseline under each backbone. Bold and underlined values denote best and second-best per LLM.

4.3. Main Results on MLE-Bench Lite

MLE-Bench Lite complements PaperBench by focusing on sustained experiment improvement on competition-style ML tasks, where success depends less on one-shot reproduction and more on whether an agent can iteratively refine a pipeline.

On the controlled evaluation, AiScientist delivers the strongest overall performance under both backbones. It reaches the same 81.82 Any Medal% with Gemini-3-Flash and GLM-5, improving over the strongest matched baseline by 4.55 and 18.18 points, respectively. This also exceeds all reported leaderboard Any Medal values listed in Table 2.

Table 2 — MLE-Bench Lite results — Table 2. Main results on MLE-Bench Lite (%). Upper block: official leaderboard as contextual reference. Lower block: matched controlled evaluation.

4.4. Mechanism Analysis

We analyze which mechanisms account for the gains of AiScientist. We focus on two questions: whether durable artifact-based continuity is a key driver, and whether simpler non-hierarchical agent organizations would suffice.

Figure 3 — Mechanism analysis — Figure 3. Mechanism analysis under GLM-5. Left: AiScientist outperforms both a simpler agent baseline and the variant without File-as-Bus on PaperBench. Right: File-as-Bus is especially important for later-round refinement (Above Median, Silver, Gold, Any Medal) on MLE-Bench Lite.

File-as-Bus Ablation

Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite Any Medal by 31.82 points. Valid Submission and Bronze are largely intact, but losses concentrate on the stronger outcome metrics (Above Median, Silver, Gold). Durable artifacts matter more for later-round refinement than for a minimally competitive first submission.

Simpler Agent Comparison

The gains are not merely from adding interaction volume. IterativeAgent already adds more interaction than BasicAgent yet still trails AiScientist without File-as-Bus — indicating that hierarchical orchestration adds value beyond longer chat sessions.

Why "Valid Submission" survives but "Gold" collapses without File-as-Bus

This asymmetry is the most interesting finding of the ablation. A first minimally-working submission only requires the agent to produce any runnable pipeline — one good round of implementation is often enough, so losing durable state doesn't hurt much. Gold medals, by contrast, require dozens of iterative refinements where each round must build correctly on all the tuning, failed experiments, and metric history from previous rounds. That's exactly where a reliable artifact record becomes load-bearing: without it, the agent keeps re-deriving decisions it already made and loses signal about what has and hasn't been tried.

Product takeaway: if you're building a long-running AI agent and it plateaus after "first working version," suspect a state-continuity problem before suspecting the LLM.

5. Related Work

5.1. Automating Scientific Research

Recent work has rapidly advanced the automation of scientific research across three broad directions: (1) automated scientific discovery, where agents generate ideas, synthesize literature, and run targeted experiments; (2) automated experimentation pipelines that chain planning, implementation, and evaluation; and (3) automated reproduction systems that replicate existing work from papers. AiScientist extends this line toward long-horizon engineering where reproduction and refinement are interleaved over many hours of autonomous work.

5.2. Multi-Agent Coordination and Long-Horizon Continuity

Multi-agent coordination has become a central paradigm — CAMEL, MetaGPT, ChatDev showed how role-playing, standardized procedures, and structured communication improve collaborative problem solving. Recent analyses further suggest that the gains of multi-agent systems are often limited not by local reasoning quality but by failures of coordination, misalignment, and verification across long trajectories. AiScientist addresses this by making the workspace itself — not the chat — the system of record.

6. Conclusion

Autonomous long-horizon ML research engineering is difficult because failures surface late, their causes are often confounded across interpretation, implementation, and infrastructure, and progress must remain coherent across repeated implementation–experimentation cycles. AiScientist reframes this as a systems problem — coordinating specialized work over durable project state — and demonstrates that the combination of hierarchical orchestration and File-as-Bus coordination yields substantial gains on both PaperBench and MLE-Bench Lite. We hope the thin control over thick state principle offers a useful lens for future long-horizon autonomous research systems.

arXiv: 2604.13018 · Date: Apr 15, 2026 · License: as on arXiv

Agent Reasoning Benchmark

Read the full paper on arXiv ↗