---
arxiv_id: 2510.04618
title: "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models"
authors:
  - Ahanaf Tazwar Shamim
  - Farhan Sadik
  - Taiyeong Lee
difficulty: Intermediate
tags:
  - Agent
  - LLM
  - Reasoning
published_at: 2025-10-06
flecto_url: https://flecto.zer0ai.dev/papers/2510.04618/
lang: en
---

## Abstract

Large language model (LLM) applications such as agents and domain-specific reasoning
        increasingly rely on context adaptation : modifying inputs with instructions,
        strategies, or evidence, rather than weight updates. Prior approaches improve usability
        but often suffer from brevity bias , which drops domain insights for
        concise summaries, and from context collapse , where iterative rewriting
        erodes details over time.

We introduce ACE (Agentic Context Engineering) , a framework that treats
        contexts as evolving playbooks that accumulate, refine, and organize strategies through
        a modular process of generation, reflection, and curation. ACE prevents collapse with
        structured, incremental updates that preserve detailed knowledge and scale with
        long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts
        both offline (system prompts) and online (agent memory), consistently outperforming
        strong baselines: +10.6% on agents and +8.6% on
        finance.

#### What is Agentic Context Engineering?

Traditional LLM improvement requires retraining — changing the model's internal weights,
        which is expensive and slow. Agentic Context Engineering (ACE) takes a
        different approach: instead of changing the model, it continuously improves what you
        feed into the model (the context). Think of it as writing an ever-better instruction
        manual that the AI reads before each task, rather than retraining the AI itself.
        The "agentic" part means this improvement process is itself automated by AI agents.

## 1. Introduction — Overall Performance

Modern AI applications increasingly depend on context adaptation. Instead of modifying
      model weights, context adaptation improves performance after training by incorporating
      clarified instructions, structured reasoning steps, or domain-specific knowledge directly
      into the model's inputs.

#### Context Adaptation vs. Fine-Tuning

Fine-tuning bakes knowledge permanently into model weights — it is costly, requires
        labelled data, and produces a new model checkpoint for every update. Context adaptation
        instead prepends instructions, examples, or strategies to the model's input at inference
        time. No GPU training is needed, updates are instant, and the same base model can serve
        many different tasks by simply swapping the context. ACE automates and optimises that
        context over time.

Our key findings:

- ACE consistently outperforms strong baselines, yielding average gains of +10.6% on agents and +8.6% on domain-specific benchmarks.

- ACE constructs effective contexts without labeled supervision, leveraging execution feedback and environment signals.

- On AppWorld, ACE matches the top-ranked production-level agent IBM-CUGA (GPT-4.1) while using the smaller open-source model DeepSeek-V3.1.

- ACE requires significantly fewer rollouts and achieves lower adaptation latency than existing methods.

## 2. Background & Motivation

### 2.1 Context Adaptation

Context adaptation refers to methods that improve model behavior by constructing or
      modifying inputs to an LLM, rather than altering its weights. Representative methods
      include Reflexion, TextGrad, GEPA, and Dynamic Cheatsheet — all leveraging natural
      language feedback for iterative context improvement.

### 2.2 Limitations: Brevity Bias & Context Collapse

#### Why do these failure modes matter?

Imagine asking an AI to maintain a strategy guide for chess. Two problems emerge:

- Brevity bias: The AI keeps trimming the guide to a short bullet list,
            discarding rare but important tactics ("zugzwang in king-and-pawn endgames").

- Context collapse: Every time the guide is rewritten in one pass,
            it loses more detail — after enough rewrites the guide is shorter than having no guide
            at all, and performance actually drops below the no-context baseline.

ACE fixes both by adding new items rather than
      rewriting everything, and only pruning exact duplicates.

## 3. Agentic Context Engineering (ACE)

ACE introduces a structured division of labor across three specialized LLM components,
      inspired by the agentic design of Dynamic Cheatsheet:

#### The "Evolving Playbook" metaphor

A playbook in sports or business is a living document of proven strategies —
        it grows as teams learn from wins and losses. ACE applies this metaphor to AI: the Context Playbook starts empty and is updated after every task attempt.
        The Generator uses it, the Reflector extracts lessons from the outcome, and the Curator
        writes those lessons as new entries. Over dozens of iterations the playbook becomes a
        rich, task-specific knowledge base — without ever retraining the underlying model.

### 3.1 Incremental Delta Updates

Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts : small sets of candidate bullets distilled by the Reflector
      and integrated by the Curator. This preserves past knowledge and avoids the computational
      cost of full rewrites.

#### Delta updates — like Git for AI knowledge

Instead of rewriting the whole playbook after each task, ACE only appends the difference — new tips, corrected strategies, observed failure patterns.
        This is analogous to a Git commit: each run produces a small diff that is merged
        into the main branch. The full history is preserved, nothing is lost, and the total
        cost per update is much lower than regenerating the entire document.

### 3.2 Grow-and-Refine

Beyond incremental growth, ACE ensures contexts remain compact through periodic or lazy
      refinement. A de-duplication step prunes redundancy by comparing bullets via semantic
      embeddings, maintaining comprehensive but non-redundant playbooks.

#### Semantic de-duplication explained

As the playbook grows, some entries will say essentially the same thing in different
        words. ACE detects this using semantic embeddings : each bullet is converted
        into a numerical vector that captures its meaning. Bullets whose vectors are very
        close together (cosine similarity above a threshold) are considered duplicates and one
        is removed. This keeps the playbook lean without discarding genuinely distinct knowledge.

## 4. Results

### 4.3 Agent Benchmark: AppWorld (DeepSeek-V3.1-671B)

Table 1: AppWorld results. ACE outperforms all baselines by an average of +10.6% ,
      and remains effective even without ground-truth labels.

#### Reading the AppWorld benchmark metrics

- TGC (Task Goal Completion): Did the agent fully accomplish the assigned goal? Strict binary pass/fail.

- SGC (Sub-Goal Completion): What fraction of intermediate steps were completed correctly? A softer measure.

- Test-Normal vs Test-Challenge: "Challenge" tasks are harder multi-step workflows; performance gaps are larger there.

- GT Labels (✓/✗): Whether the method used ground-truth outcome labels during adaptation. ACE without labels (✗) still beats supervised competitors (✓), showing it can self-improve from execution feedback alone.

- Offline vs Online: Offline improves the system prompt before deployment; Online updates agent memory during live use.

### 4.4 Domain-Specific Benchmark: Finance (DeepSeek-V3.1)

Table 2: Financial analysis results. ACE achieves +12.8% average improvement with GT labels in offline adaptation.

#### Why does domain-specific improvement matter?

Financial NLP tasks like FiNER (named-entity recognition in financial text)
        and Formula (financial formula reasoning) require highly specialised knowledge
        — abbreviations, reporting conventions, sector-specific jargon — that general LLM training
        underrepresents. ACE's playbook accumulates these domain nuances from in-domain examples,
        effectively giving the model a field-specific reference sheet that grows with every solved case.
        The +12.8% jump with ground-truth labels shows just how much headroom domain-specific context
        can unlock compared to a generic base LLM.

## Key Findings
