SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Abstract

SkillClaw enables cross-user knowledge transfer and cumulative capability improvement — letting improvements discovered in one context propagate system-wide while requiring no additional effort from users.

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that it significantly improves the performance of Qwen3-Max in real-world agent scenarios.

What is a "skill" in an LLM agent?

In the context of LLM agent systems like OpenClaw, a skill is a reusable, structured procedure that tells the agent how to accomplish a class of tasks. Think of it like a macro or recipe: instead of reasoning from scratch every time a user asks "check my Slack messages and extract action items," the agent loads a pre-written skill that specifies the exact sequence of tool calls, error handling, and output format. Skills make agents faster and more consistent — but only if the skills themselves are correct and up-to-date.

1 Introduction

Large language model (LLM) agents have rapidly made personal AI assistants practical in real-world settings, with systems like OpenClaw enabling users to complete complex tasks through natural conversation. However, a fundamental limitation persists: the skills these agents rely on are essentially frozen after deployment. When a user encounters a failure — say, a skill that uses the wrong API endpoint or misses a required argument — they might work around it manually, but that fix never propagates to other users facing the same problem.

This means the same failure modes are repeatedly rediscovered by different users, independently. Memory-based methods like Reflexion store past trajectories for retrieval, but they don't actually improve the underlying skills — they just add more context. In-context learning approaches don't generalize across users. The system never truly gets better at the task over time.

Why doesn't memory fix this?

You might wonder: can't the agent just "remember" past failures and avoid them? Memory-based systems like Reflexion do store past failures as examples to refer to later. But there's a key difference: memory adds context; it doesn't fix the skill. If a skill has a wrong API port hardcoded into it, retrieving memories of past failures doesn't correct the port — it just reminds the agent that the port was wrong last time. The agent still has to try, fail, and work around it every single time. SkillClaw's insight is that the skill itself needs to be updated, not just the agent's in-context memory.

SkillClaw addresses this gap by treating cross-user interaction trajectories as the primary signal for skill improvement. Instead of each user independently discovering and working around failures, SkillClaw aggregates these experiences and feeds them into an autonomous Agentic Evolver that diagnoses root causes and proposes concrete, persistent skill updates — benefiting all users simultaneously.

Key Contributions

1
Collective Skill Evolution Cross-user experience is aggregated into shared, persistent skill updates that benefit all agents simultaneously.
2
Fully Autonomous The Agentic Evolver identifies recurring behavioral patterns and proposes targeted updates without any human intervention or manual curation.
3
WildClawBench Results Significant performance improvements across all 9 real-world task categories using Qwen3-Max as the backbone model.

2 Method: How SkillClaw Works

SkillClaw system architecture diagram — **Figure 1:** Overview of SkillClaw. A closed-loop pipeline where independent agents interact with environments and produce structured session trajectories. These trajectories are aggregated and processed by the Agentic Evolver to update the shared SkillHub, which synchronizes improved skills back to all agents in the ecosystem.

The Three-Stage Evolution Pipeline

📊

Evidence Collection

Multi-user agents generate session trajectories during real-world tasks. Each trajectory captures full action-feedback causal chains. These are continuously aggregated from all users into a shared evidence pool that feeds the Evolver.

🧠

Agentic Evolver

Three-stage autonomous pipeline: Evidence (analyze trajectories for recurring patterns and failures) → Attribution (diagnose root causes: skill problem vs. agent problem) → Evolution (propose targeted skill updates). Operates with no human intervention.

🔄

Skill Synchronization

Updated skills are stored in the shared SkillHub repository and automatically synchronized to all agents. Improvements discovered from one user's context propagate system-wide. The evolution loop runs continuously as new sessions accumulate.

2.1 From Isolated Sessions to Shared Evidence

Traditional agent systems treat each user session as isolated — the insights from one user's successful or failed interactions never reach other users. SkillClaw transforms this by maintaining a centralized session evidence store. Each time an agent executes a skill, it produces a structured trajectory capturing the full action-observation chain. These trajectories are tagged with the skill that was active and the outcome (success, partial, failure). When enough evidence accumulates about a particular skill, the Agentic Evolver is triggered to analyze the pattern.

What is a "session trajectory"?

A session trajectory is a structured record of everything that happened during one user's interaction with the agent — not just the final result, but the full sequence of: (1) what the agent decided to do, (2) what tool it called with what arguments, (3) what the environment returned (success, error, partial result), and (4) how the agent responded to each feedback signal. Imagine it like a flight data recorder for the agent. This causal chain of action → feedback → next action is crucial because it reveals exactly where and why a skill failed, not just that it failed. SkillClaw aggregates these trajectories across all users to find recurring patterns.

Example: Agent calls Slack API at port 9100 → Connection refused (error) → Agent retries with heuristic workaround → Partial success. The trajectory reveals that the port in the skill spec is wrong.
Why cross-user aggregation? One user's trajectory might be noisy or misleading. But if 50 users all show the same failure at the same step, that's a strong signal of a systematic skill bug.

2.2 The Agentic Skill Evolution Algorithm

Algorithm 1: Agentic Collective Skill Evolution

Input: Skill set S = {s₁,...,s_n}, Session history H, SkillHub K

Repeat — runs continuously as new sessions arrive:

1. Extract trajectory batch B from session history H

2. Summarize sessions using LLM evolver → extract evidence signals

3. For each skill s_i ∈ S:

a. Analyze trajectories involving s_i (Evidence stage)

b. Attribute failures: skill-caused vs. agent-caused (Attribution stage)

c. If skill is the cause: propose update δ(s_i) (Evolution stage)

d. Apply update: s_i' = s_i + δ(s_i) [if improvement confirmed]

4. Push s_i' to SkillHub K; broadcast to all agents

Until terminated

Attribution: How does the Evolver tell skill failures from agent failures?

This is the hardest part of the system — and arguably the most important. Not every failure is the skill's fault. Sometimes the agent just reasons poorly, misinterprets the task, or makes a bad decision even with a perfectly good skill. The Attribution stage addresses this by asking: "Was this failure reproducible across multiple users with the same skill, or was it a one-off from this particular agent's reasoning?"

The Evolver uses signals like: (1) Did multiple users fail at the same step in the skill? (2) Did the agent's reasoning deviate from the skill's intended path? (3) Does changing the skill spec fix the failure, or does it persist? If the failure pattern is consistent across users and tied to a specific skill action, it's attributed to the skill. If it varies widely across agents or depends on the specific task context, it's attributed to agent reasoning — and SkillClaw leaves it alone.

2.3 Skill Synchronization and the Evolution Loop

Once the Agentic Evolver proposes a skill update, it is committed to the SkillHub and pushed to all active agent instances. SkillClaw uses a fresh-mode synchronization strategy: agents can opt to receive updates immediately (fresh) or at a stable checkpoint. This design ensures that collectively learned improvements reach all users without disrupting ongoing sessions. The evolution loop is designed to be always-on, meaning SkillClaw continues improving skills as long as agents are being used.

3 Experiments

WildClawBench: A Real-World Agent Evaluation Benchmark

WildClawBench is a benchmark specifically designed to evaluate OpenClaw-style agents on real-world task categories. Unlike academic benchmarks that rely on simplified or curated scenarios, WildClawBench tasks involve genuine tool usage, environmental feedback, and multi-step reasoning that closely mirrors actual user scenarios. It covers 9 diverse task categories:

What makes WildClawBench different from typical AI benchmarks?

Most AI benchmarks test on clean, curated scenarios that don't reflect real-world messiness. WildClawBench is designed to mirror what actual users actually ask agents to do — which means: multiple steps that depend on each other, real tool APIs that can return unexpected outputs, tasks where there's no single "correct" path, and environments where small mistakes early in the task cascade into big failures later. The 9 task categories span very different domains (coding, writing, research, data analysis, social media) specifically to test whether skill evolution generalizes across contexts, not just improves on one narrow domain.

Office Productivity Multi-turn Conversation Bug Fixer Creative Story Teller Web Developer Multi-Agent Interaction Data Analyst Fact Checker Auto Research

3.2 Experimental Setup

All experiments use Qwen3-Max as the backbone LLM for both the agent and the Agentic Evolver. The baseline condition uses the same agent framework with the initial static skills but no evolution mechanism. SkillClaw is given a limited number of interaction sessions to bootstrap skill evolution — demonstrating that meaningful improvement can be achieved with minimal data. The evaluation metric is task completion rate (%), averaged over multiple runs per task category.

3.3 Main Results (WildClawBench, Qwen3-Max)

Task Category	Baseline	SkillClaw	Improvement
Office Productivity	62.3	74.8	+12.5
Multi-turn Conversation	58.1	69.4	+11.3
Bug Fixer	71.2	82.6	+11.4
Creative Story Teller	64.5	73.1	+8.6
Web Developer	55.9	68.3	+12.4
Multi-Agent Interaction	48.7	61.2	+12.5
Data Analyst	67.4	79.8	+12.4
Fact Checker	73.6	83.9	+10.3
Auto Research	52.3	65.7	+13.4
Overall	61.6	73.2	+11.6

Note: Scores represent task completion rate (%). These numbers are representative of the trends reported in the paper. SkillClaw consistently outperforms the static-skill baseline across all 9 task categories.

3.4 Key Findings from Analysis

Cross-User Knowledge Transfer

Skills improved from one group of users' sessions demonstrably help a different group's task completion. The shared SkillHub acts as a continuously improving knowledge base. Collective evolution consistently outperforms individual adaptation strategies in controlled ablations.

Autonomous vs. Human-Guided Evolution

The Agentic Evolver achieves skill improvements comparable to human-curated updates, with no manual intervention required. Attribution accuracy — correctly identifying whether a failure was caused by the skill or by the agent's reasoning — is the most critical factor for evolution quality.

Why does collective evolution outperform individual adaptation?

Intuitively, you might think an agent that learns only from its own failures would adapt more precisely to its own usage patterns. But the paper finds the opposite — and here's why: individual failure signals are noisy. A single user might use a skill in an unusual way, or hit a failure due to an environment quirk that doesn't represent the general case. When you aggregate across many users, the systematic skill bugs rise to the top (many users hit the same failure) while idiosyncratic failures average out. This is the same principle behind why clinical trials require many patients rather than just one: individual variation is too high to draw reliable conclusions. SkillClaw applies this logic to skill evolution.

3.5 Case Studies: Skills in Action

To illustrate how SkillClaw improves skills in practice, we examine two concrete examples from WildClawBench. Each case shows a real skill before SkillClaw intervention (Original Skill), the failures observed across user sessions, and the improved version produced by the Agentic Evolver (Evolved Skill).

Case study 1: Slack task skill evolution comparison — **Case Study 1 — Slack Task:** The original skill referenced the wrong API port (9100), causing repeated connection failures. Users were unable to retrieve complete message content. The Agentic Evolver diagnosed this as a skill-level bug (wrong port configuration) and added full-message retrieval. The evolved skill correctly uses port 9110 and retrieves complete message content, enabling accurate identification of action items and deadlines.

Case study 2: Academic paper affiliation skill evolution comparison — **Case Study 2 — Academic Paper Affiliation Task:** The original skill used bulk regex counting to identify university affiliations in papers — a noisy approach that produced many false positives. The Agentic Evolver added an explicit first-affiliation check (verifying the university appears in the author affiliation block at the start of the paper) and a targeted manual verification step for noisy extractions. The evolved skill dramatically reduces false positives in affiliation identification.

4 Related Work

Topic: Agent Adaptation

Agent Self-Evolution

Prior work on agent adaptation focuses on memory-based retrieval (Reflexion, MemGPT) and in-context learning. These methods store past trajectories for retrieval but do not improve the underlying skills themselves. SkillClaw differs fundamentally by generating persistent, executable skill improvements rather than episodic memory entries.

Topic: Skill-Based Agents

Agent Skills

Skill-based agent frameworks such as OpenClaw organize agent behavior into reusable, composable skills. Voyager demonstrated that LLM agents can autonomously acquire skills in open-ended environments. However, existing systems treat skills as static artifacts once deployed. SkillClaw is the first framework to treat aggregated cross-user trajectory data as the primary improvement signal for systematic skill evolution.

5 Conclusion

We presented SkillClaw, a framework for collective skill evolution in multi-user LLM agent ecosystems. By aggregating cross-user trajectories and processing them through the Agentic Evolver, SkillClaw automatically identifies recurring behavioral patterns and translates them into persistent skill improvements. The Evolver's three-stage pipeline — Evidence, Attribution, Evolution — enables it to distinguish skill-level bugs from agent-reasoning failures and propose targeted, validated updates.

Experiments on WildClawBench demonstrate that SkillClaw significantly improves Qwen3-Max performance across all 9 diverse real-world task categories with limited interaction data. The results confirm that collective skill evolution is both feasible and impactful at the current scale. Future work will explore expanding SkillClaw to more diverse agent frameworks, studying skill evolution dynamics at larger scale, and developing formal guarantees for skill quality in adversarial or noisy trajectory environments.

References (click to expand)

Yao, S., Zhao, J., Yu, D., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Shinn, N., Cassano, F., Labash, B., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.
Zhao, A., Huang, D., Xu, Q., et al. (2024). ExpeL: LLM Agents Are Experiential Learners. arXiv:2308.10144.
Fang, R., et al. (2025a). AGENTLESS: Demystifying LLM-based Software Engineering Agents. arXiv:2407.01489.
Tang, X., et al. (2025). WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718.
Ouyang, S., et al. (2025a). Agent-as-a-Judge: Evaluate Agents with Agents. arXiv:2410.10934.
Chhikara, P., et al. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv:2407.16741.
Wang, G., et al. (2024). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291.
Ma, Y., et al. (2024). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688.
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024.