Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Abstract

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.

We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (system prompts) and online (agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance.

What is Agentic Context Engineering?

Traditional LLM improvement requires retraining — changing the model's internal weights, which is expensive and slow. Agentic Context Engineering (ACE) takes a different approach: instead of changing the model, it continuously improves what you feed into the model (the context). Think of it as writing an ever-better instruction manual that the AI reads before each task, rather than retraining the AI itself. The "agentic" part means this improvement process is itself automated by AI agents.

1. Introduction — Overall Performance

Modern AI applications increasingly depend on context adaptation. Instead of modifying model weights, context adaptation improves performance after training by incorporating clarified instructions, structured reasoning steps, or domain-specific knowledge directly into the model's inputs.

Context Adaptation vs. Fine-Tuning

Fine-tuning bakes knowledge permanently into model weights — it is costly, requires labelled data, and produces a new model checkpoint for every update. Context adaptation instead prepends instructions, examples, or strategies to the model's input at inference time. No GPU training is needed, updates are instant, and the same base model can serve many different tasks by simply swapping the context. ACE automates and optimises that context over time.

Overall performance comparison bar charts for AppWorld, FiNER, and Formula benchmarks — **Figure 1: Overall Performance Results.** ACE consistently outperforms strong baselines (Base LLM, ICL, GEPA, DC) across all tasks: AppWorld (42.4% → 59.5%), FiNER (70.7% → 78.3%), Formula (67.5% → 76.5%).

Our key findings:

ACE consistently outperforms strong baselines, yielding average gains of +10.6% on agents and +8.6% on domain-specific benchmarks.
ACE constructs effective contexts without labeled supervision, leveraging execution feedback and environment signals.
On AppWorld, ACE matches the top-ranked production-level agent IBM-CUGA (GPT-4.1) while using the smaller open-source model DeepSeek-V3.1.
ACE requires significantly fewer rollouts and achieves lower adaptation latency than existing methods.

2. Background & Motivation

2.1 Context Adaptation

Context adaptation refers to methods that improve model behavior by constructing or modifying inputs to an LLM, rather than altering its weights. Representative methods include Reflexion, TextGrad, GEPA, and Dynamic Cheatsheet — all leveraging natural language feedback for iterative context improvement.

2.2 Limitations: Brevity Bias & Context Collapse

Brevity Bias: Prompt optimizers prioritize concise applicable instructions over comprehensive accumulation, omitting domain-specific heuristics and common failure modes critical in practice.

Context Collapse: Methods relying on monolithic rewriting by an LLM degrade into shorter, less informative summaries over time, causing sharp performance declines.

Why do these failure modes matter?

Imagine asking an AI to maintain a strategy guide for chess. Two problems emerge:

Brevity bias: The AI keeps trimming the guide to a short bullet list, discarding rare but important tactics ("zugzwang in king-and-pawn endgames").
Context collapse: Every time the guide is rewritten in one pass, it loses more detail — after enough rewrites the guide is shorter than having no guide at all, and performance actually drops below the no-context baseline.

ACE fixes both by adding new items rather than rewriting everything, and only pruning exact duplicates.

Context collapse graph: tokens vs adaptation steps showing sharp collapse at step 60 — **Figure 2: Context Collapse.** Monolithic rewriting collapses context from 18,282 tokens (accuracy 66.7) to just 122 tokens (accuracy 57.1) in a single step — worse than the no-context baseline of 63.7.

3. Agentic Context Engineering (ACE)

ACE introduces a structured division of labor across three specialized LLM components, inspired by the agentic design of Dynamic Cheatsheet:

🔧 Generator

Produces reasoning trajectories for new queries using the current Context Playbook.

🔍 Reflector

Distills concrete insights from successes and errors; supports iterative refinement.

📚 Curator

Integrates insights into structured delta context updates, merged into the Playbook.

The "Evolving Playbook" metaphor

A playbook in sports or business is a living document of proven strategies — it grows as teams learn from wins and losses. ACE applies this metaphor to AI: the Context Playbook starts empty and is updated after every task attempt. The Generator uses it, the Reflector extracts lessons from the outcome, and the Curator writes those lessons as new entries. Over dozens of iterations the playbook becomes a rich, task-specific knowledge base — without ever retraining the underlying model.

ACE Framework architecture diagram: Generator to Reflector to Curator with delta updates — **Figure 4: The ACE Framework.** An agentic architecture with three specialized components: Generator, Reflector, and Curator. Delta Context Items are merged incrementally — no costly full rewrites.

3.1 Incremental Delta Updates

Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This preserves past knowledge and avoids the computational cost of full rewrites.

Delta updates — like Git for AI knowledge

Instead of rewriting the whole playbook after each task, ACE only appends the difference — new tips, corrected strategies, observed failure patterns. This is analogous to a Git commit: each run produces a small diff that is merged into the main branch. The full history is preserved, nothing is lost, and the total cost per update is much lower than regenerating the entire document.

3.2 Grow-and-Refine

Beyond incremental growth, ACE ensures contexts remain compact through periodic or lazy refinement. A de-duplication step prunes redundancy by comparing bullets via semantic embeddings, maintaining comprehensive but non-redundant playbooks.

Semantic de-duplication explained

As the playbook grows, some entries will say essentially the same thing in different words. ACE detects this using semantic embeddings: each bullet is converted into a numerical vector that captures its meaning. Bullets whose vectors are very close together (cosine similarity above a threshold) are considered duplicates and one is removed. This keeps the playbook lean without discarding genuinely distinct knowledge.

Example ACE-generated context playbook on AppWorld with strategies, code snippets, and troubleshooting — **Figure 3: Example ACE-Generated Context on AppWorld (partial).** ACE contexts are detailed playbooks with strategies, code templates, and troubleshooting notes — not concise summaries.

4. Results

4.3 Agent Benchmark: AppWorld (DeepSeek-V3.1-671B)

Method	GT Labels	Test-Normal		Test-Challenge		Average
Method	GT Labels	TGC↑	SGC↑	TGC↑	SGC↑	Average
DeepSeek-V3.1-671B as Base LLM
ReAct	—	63.7	42.9	41.5	21.6	42.4
Offline Adaptation
ReAct + ICL	✓	64.3	46.4	46.0	27.3	46.0
ReAct + GEPA	✓	64.9	44.6	46.0	30.2	46.4
ReAct + ACE	✓	76.2	64.3	57.3	39.6	59.4
ReAct + ACE	✗	75.0	64.3	54.4	35.2	57.2
Online Adaptation
ReAct + DC (CU)	✗	65.5	58.9	52.3	30.8	51.9
ReAct + ACE	✗	69.6	53.6	66.0	48.9	59.5

Table 1: AppWorld results. ACE outperforms all baselines by an average of +10.6%, and remains effective even without ground-truth labels.

Reading the AppWorld benchmark metrics

TGC (Task Goal Completion): Did the agent fully accomplish the assigned goal? Strict binary pass/fail.
SGC (Sub-Goal Completion): What fraction of intermediate steps were completed correctly? A softer measure.
Test-Normal vs Test-Challenge: "Challenge" tasks are harder multi-step workflows; performance gaps are larger there.
GT Labels (✓/✗): Whether the method used ground-truth outcome labels during adaptation. ACE without labels (✗) still beats supervised competitors (✓), showing it can self-improve from execution feedback alone.
Offline vs Online: Offline improves the system prompt before deployment; Online updates agent memory during live use.

4.4 Domain-Specific Benchmark: Finance (DeepSeek-V3.1)

Method	GT Labels	FiNER (Acc↑)	Formula (Acc↑)	Average
DeepSeek-V3.1 as Base LLM
Base LLM	—	70.7	67.5	69.1
Offline Adaptation
ICL	✓	72.3	67.0	69.6
MIPROv2	✓	72.4	69.5	70.9
GEPA	✓	73.5	71.5	72.5
ACE	✓	78.3	85.5	81.9
ACE	✗	71.1	83.0	77.1
Online Adaptation
DC (CU)	✓	74.2	69.5	71.8
DC (CU)	✗	68.3	62.5	65.4
ACE	✓	76.7	76.5	76.6
ACE	✗	67.3	78.5	72.9

Table 2: Financial analysis results. ACE achieves +12.8% average improvement with GT labels in offline adaptation.

Why does domain-specific improvement matter?

Financial NLP tasks like FiNER (named-entity recognition in financial text) and Formula (financial formula reasoning) require highly specialised knowledge — abbreviations, reporting conventions, sector-specific jargon — that general LLM training underrepresents. ACE's playbook accumulates these domain nuances from in-domain examples, effectively giving the model a field-specific reference sheet that grows with every solved case. The +12.8% jump with ground-truth labels shows just how much headroom domain-specific context can unlock compared to a generic base LLM.

Key Findings

+10.6%

High-Performance Agents

Average gain on AppWorld agent benchmark, matching production-level GPT-4.1 agent with DeepSeek-V3.1.

+8.6%

Domain Reasoning

Average gain on financial analysis benchmarks (FiNER, Formula) with structured evolving playbooks.

−86.9%

Adaptation Latency

ACE achieves these gains efficiently, drastically reducing adaptation latency vs existing methods.

No GT

Label-Free Adaptation

ACE adapts effectively without labeled supervision — uses execution feedback and environment signals only.