Stanford University ย ยทย SambaNova Systems ย ยทย UC Berkeley ย ย |ย ย *Equal contribution
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time.
We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (system prompts) and online (agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance.
Traditional LLM improvement requires retraining โ changing the model's internal weights, which is expensive and slow. Agentic Context Engineering (ACE) takes a different approach: instead of changing the model, it continuously improves what you feed into the model (the context). Think of it as writing an ever-better instruction manual that the AI reads before each task, rather than retraining the AI itself. The "agentic" part means this improvement process is itself automated by AI agents.
Modern AI applications increasingly depend on context adaptation. Instead of modifying model weights, context adaptation improves performance after training by incorporating clarified instructions, structured reasoning steps, or domain-specific knowledge directly into the model's inputs.
Fine-tuning bakes knowledge permanently into model weights โ it is costly, requires labelled data, and produces a new model checkpoint for every update. Context adaptation instead prepends instructions, examples, or strategies to the model's input at inference time. No GPU training is needed, updates are instant, and the same base model can serve many different tasks by simply swapping the context. ACE automates and optimises that context over time.
Our key findings:
Context adaptation refers to methods that improve model behavior by constructing or modifying inputs to an LLM, rather than altering its weights. Representative methods include Reflexion, TextGrad, GEPA, and Dynamic Cheatsheet โ all leveraging natural language feedback for iterative context improvement.
Imagine asking an AI to maintain a strategy guide for chess. Two problems emerge:
ACE fixes both by adding new items rather than rewriting everything, and only pruning exact duplicates.
ACE introduces a structured division of labor across three specialized LLM components, inspired by the agentic design of Dynamic Cheatsheet:
A playbook in sports or business is a living document of proven strategies โ it grows as teams learn from wins and losses. ACE applies this metaphor to AI: the Context Playbook starts empty and is updated after every task attempt. The Generator uses it, the Reflector extracts lessons from the outcome, and the Curator writes those lessons as new entries. Over dozens of iterations the playbook becomes a rich, task-specific knowledge base โ without ever retraining the underlying model.
Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This preserves past knowledge and avoids the computational cost of full rewrites.
Instead of rewriting the whole playbook after each task, ACE only appends the difference โ new tips, corrected strategies, observed failure patterns. This is analogous to a Git commit: each run produces a small diff that is merged into the main branch. The full history is preserved, nothing is lost, and the total cost per update is much lower than regenerating the entire document.
Beyond incremental growth, ACE ensures contexts remain compact through periodic or lazy refinement. A de-duplication step prunes redundancy by comparing bullets via semantic embeddings, maintaining comprehensive but non-redundant playbooks.
As the playbook grows, some entries will say essentially the same thing in different words. ACE detects this using semantic embeddings: each bullet is converted into a numerical vector that captures its meaning. Bullets whose vectors are very close together (cosine similarity above a threshold) are considered duplicates and one is removed. This keeps the playbook lean without discarding genuinely distinct knowledge.
| Method | GT Labels | Test-Normal | Test-Challenge | Average | ||
|---|---|---|---|---|---|---|
| TGCโ | SGCโ | TGCโ | SGCโ | |||
| DeepSeek-V3.1-671B as Base LLM | ||||||
| ReAct | โ | 63.7 | 42.9 | 41.5 | 21.6 | 42.4 |
| Offline Adaptation | ||||||
| ReAct + ICL | โ | 64.3 | 46.4 | 46.0 | 27.3 | 46.0 |
| ReAct + GEPA | โ | 64.9 | 44.6 | 46.0 | 30.2 | 46.4 |
| ReAct + ACE | โ | 76.2 | 64.3 | 57.3 | 39.6 | 59.4 |
| ReAct + ACE | โ | 75.0 | 64.3 | 54.4 | 35.2 | 57.2 |
| Online Adaptation | ||||||
| ReAct + DC (CU) | โ | 65.5 | 58.9 | 52.3 | 30.8 | 51.9 |
| ReAct + ACE | โ | 69.6 | 53.6 | 66.0 | 48.9 | 59.5 |
Table 1: AppWorld results. ACE outperforms all baselines by an average of +10.6%, and remains effective even without ground-truth labels.
| Method | GT Labels | FiNER (Accโ) | Formula (Accโ) | Average |
|---|---|---|---|---|
| DeepSeek-V3.1 as Base LLM | ||||
| Base LLM | โ | 70.7 | 67.5 | 69.1 |
| Offline Adaptation | ||||
| ICL | โ | 72.3 | 67.0 | 69.6 |
| MIPROv2 | โ | 72.4 | 69.5 | 70.9 |
| GEPA | โ | 73.5 | 71.5 | 72.5 |
| ACE | โ | 78.3 | 85.5 | 81.9 |
| ACE | โ | 71.1 | 83.0 | 77.1 |
| Online Adaptation | ||||
| DC (CU) | โ | 74.2 | 69.5 | 71.8 |
| DC (CU) | โ | 68.3 | 62.5 | 65.4 |
| ACE | โ | 76.7 | 76.5 | 76.6 |
| ACE | โ | 67.3 | 78.5 | 72.9 |
Table 2: Financial analysis results. ACE achieves +12.8% average improvement with GT labels in offline adaptation.
Financial NLP tasks like FiNER (named-entity recognition in financial text) and Formula (financial formula reasoning) require highly specialised knowledge โ abbreviations, reporting conventions, sector-specific jargon โ that general LLM training underrepresents. ACE's playbook accumulates these domain nuances from in-domain examples, effectively giving the model a field-specific reference sheet that grows with every solved case. The +12.8% jump with ground-truth labels shows just how much headroom domain-specific context can unlock compared to a generic base LLM.
arXiv:2510.04618v3 [cs.LG] ย ยทย Published at ICLR 2026
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.