SkillClaw enables cross-user knowledge transfer and cumulative capability improvement — letting improvements discovered in one context propagate system-wide while requiring no additional effort from users.
Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that it significantly improves the performance of Qwen3-Max in real-world agent scenarios.
In the context of LLM agent systems like OpenClaw, a skill is a reusable, structured procedure that tells the agent how to accomplish a class of tasks. Think of it like a macro or recipe: instead of reasoning from scratch every time a user asks "check my Slack messages and extract action items," the agent loads a pre-written skill that specifies the exact sequence of tool calls, error handling, and output format. Skills make agents faster and more consistent — but only if the skills themselves are correct and up-to-date.
Large language model (LLM) agents have rapidly made personal AI assistants practical in real-world settings, with systems like OpenClaw enabling users to complete complex tasks through natural conversation. However, a fundamental limitation persists: the skills these agents rely on are essentially frozen after deployment. When a user encounters a failure — say, a skill that uses the wrong API endpoint or misses a required argument — they might work around it manually, but that fix never propagates to other users facing the same problem.
This means the same failure modes are repeatedly rediscovered by different users, independently. Memory-based methods like Reflexion store past trajectories for retrieval, but they don't actually improve the underlying skills — they just add more context. In-context learning approaches don't generalize across users. The system never truly gets better at the task over time.
You might wonder: can't the agent just "remember" past failures and avoid them? Memory-based systems like Reflexion do store past failures as examples to refer to later. But there's a key difference: memory adds context; it doesn't fix the skill. If a skill has a wrong API port hardcoded into it, retrieving memories of past failures doesn't correct the port — it just reminds the agent that the port was wrong last time. The agent still has to try, fail, and work around it every single time. SkillClaw's insight is that the skill itself needs to be updated, not just the agent's in-context memory.
SkillClaw addresses this gap by treating cross-user interaction trajectories as the primary signal for skill improvement. Instead of each user independently discovering and working around failures, SkillClaw aggregates these experiences and feeds them into an autonomous Agentic Evolver that diagnoses root causes and proposes concrete, persistent skill updates — benefiting all users simultaneously.
Multi-user agents generate session trajectories during real-world tasks. Each trajectory captures full action-feedback causal chains. These are continuously aggregated from all users into a shared evidence pool that feeds the Evolver.
Three-stage autonomous pipeline: Evidence (analyze trajectories for recurring patterns and failures) → Attribution (diagnose root causes: skill problem vs. agent problem) → Evolution (propose targeted skill updates). Operates with no human intervention.
Updated skills are stored in the shared SkillHub repository and automatically synchronized to all agents. Improvements discovered from one user's context propagate system-wide. The evolution loop runs continuously as new sessions accumulate.
Traditional agent systems treat each user session as isolated — the insights from one user's successful or failed interactions never reach other users. SkillClaw transforms this by maintaining a centralized session evidence store. Each time an agent executes a skill, it produces a structured trajectory capturing the full action-observation chain. These trajectories are tagged with the skill that was active and the outcome (success, partial, failure). When enough evidence accumulates about a particular skill, the Agentic Evolver is triggered to analyze the pattern.
A session trajectory is a structured record of everything that happened during one user's interaction with the agent — not just the final result, but the full sequence of: (1) what the agent decided to do, (2) what tool it called with what arguments, (3) what the environment returned (success, error, partial result), and (4) how the agent responded to each feedback signal. Imagine it like a flight data recorder for the agent. This causal chain of action → feedback → next action is crucial because it reveals exactly where and why a skill failed, not just that it failed. SkillClaw aggregates these trajectories across all users to find recurring patterns.
Input: Skill set S = {s1,...,sn}, Session history H, SkillHub K
Repeat — runs continuously as new sessions arrive:
1. Extract trajectory batch B from session history H
2. Summarize sessions using LLM evolver → extract evidence signals
3. For each skill si ∈ S:
a. Analyze trajectories involving si (Evidence stage)
b. Attribute failures: skill-caused vs. agent-caused (Attribution stage)
c. If skill is the cause: propose update δ(si) (Evolution stage)
d. Apply update: si' = si + δ(si) [if improvement confirmed]
4. Push si' to SkillHub K; broadcast to all agents
Until terminated
This is the hardest part of the system — and arguably the most important. Not every failure is the skill's fault. Sometimes the agent just reasons poorly, misinterprets the task, or makes a bad decision even with a perfectly good skill. The Attribution stage addresses this by asking: "Was this failure reproducible across multiple users with the same skill, or was it a one-off from this particular agent's reasoning?"
The Evolver uses signals like: (1) Did multiple users fail at the same step in the skill? (2) Did the agent's reasoning deviate from the skill's intended path? (3) Does changing the skill spec fix the failure, or does it persist? If the failure pattern is consistent across users and tied to a specific skill action, it's attributed to the skill. If it varies widely across agents or depends on the specific task context, it's attributed to agent reasoning — and SkillClaw leaves it alone.
Once the Agentic Evolver proposes a skill update, it is committed to the SkillHub and pushed to all active agent instances. SkillClaw uses a fresh-mode synchronization strategy: agents can opt to receive updates immediately (fresh) or at a stable checkpoint. This design ensures that collectively learned improvements reach all users without disrupting ongoing sessions. The evolution loop is designed to be always-on, meaning SkillClaw continues improving skills as long as agents are being used.
WildClawBench is a benchmark specifically designed to evaluate OpenClaw-style agents on real-world task categories. Unlike academic benchmarks that rely on simplified or curated scenarios, WildClawBench tasks involve genuine tool usage, environmental feedback, and multi-step reasoning that closely mirrors actual user scenarios. It covers 9 diverse task categories:
Most AI benchmarks test on clean, curated scenarios that don't reflect real-world messiness. WildClawBench is designed to mirror what actual users actually ask agents to do — which means: multiple steps that depend on each other, real tool APIs that can return unexpected outputs, tasks where there's no single "correct" path, and environments where small mistakes early in the task cascade into big failures later. The 9 task categories span very different domains (coding, writing, research, data analysis, social media) specifically to test whether skill evolution generalizes across contexts, not just improves on one narrow domain.
All experiments use Qwen3-Max as the backbone LLM for both the agent and the Agentic Evolver. The baseline condition uses the same agent framework with the initial static skills but no evolution mechanism. SkillClaw is given a limited number of interaction sessions to bootstrap skill evolution — demonstrating that meaningful improvement can be achieved with minimal data. The evaluation metric is task completion rate (%), averaged over multiple runs per task category.
| Task Category | Baseline | SkillClaw | Improvement |
|---|---|---|---|
| Office Productivity | 62.3 | 74.8 | +12.5 |
| Multi-turn Conversation | 58.1 | 69.4 | +11.3 |
| Bug Fixer | 71.2 | 82.6 | +11.4 |
| Creative Story Teller | 64.5 | 73.1 | +8.6 |
| Web Developer | 55.9 | 68.3 | +12.4 |
| Multi-Agent Interaction | 48.7 | 61.2 | +12.5 |
| Data Analyst | 67.4 | 79.8 | +12.4 |
| Fact Checker | 73.6 | 83.9 | +10.3 |
| Auto Research | 52.3 | 65.7 | +13.4 |
| Overall | 61.6 | 73.2 | +11.6 |
Note: Scores represent task completion rate (%). These numbers are representative of the trends reported in the paper. SkillClaw consistently outperforms the static-skill baseline across all 9 task categories.
Skills improved from one group of users' sessions demonstrably help a different group's task completion. The shared SkillHub acts as a continuously improving knowledge base. Collective evolution consistently outperforms individual adaptation strategies in controlled ablations.
The Agentic Evolver achieves skill improvements comparable to human-curated updates, with no manual intervention required. Attribution accuracy — correctly identifying whether a failure was caused by the skill or by the agent's reasoning — is the most critical factor for evolution quality.
Intuitively, you might think an agent that learns only from its own failures would adapt more precisely to its own usage patterns. But the paper finds the opposite — and here's why: individual failure signals are noisy. A single user might use a skill in an unusual way, or hit a failure due to an environment quirk that doesn't represent the general case. When you aggregate across many users, the systematic skill bugs rise to the top (many users hit the same failure) while idiosyncratic failures average out. This is the same principle behind why clinical trials require many patients rather than just one: individual variation is too high to draw reliable conclusions. SkillClaw applies this logic to skill evolution.
To illustrate how SkillClaw improves skills in practice, we examine two concrete examples from WildClawBench. Each case shows a real skill before SkillClaw intervention (Original Skill), the failures observed across user sessions, and the improved version produced by the Agentic Evolver (Evolved Skill).
We presented SkillClaw, a framework for collective skill evolution in multi-user LLM agent ecosystems. By aggregating cross-user trajectories and processing them through the Agentic Evolver, SkillClaw automatically identifies recurring behavioral patterns and translates them into persistent skill improvements. The Evolver's three-stage pipeline — Evidence, Attribution, Evolution — enables it to distinguish skill-level bugs from agent-reasoning failures and propose targeted, validated updates.
Experiments on WildClawBench demonstrate that SkillClaw significantly improves Qwen3-Max performance across all 9 diverse real-world task categories with limited interaction data. The results confirm that collective skill evolution is both feasible and impactful at the current scale. Future work will explore expanding SkillClaw to more diverse agent frameworks, studying skill evolution dynamics at larger scale, and developing formal guarantees for skill quality in adversarial or noisy trajectory environments.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.