SkillNet: Create, Evaluate, and Connect AI Skills

01 · ABSTRACT

Why agents keep reinventing the wheel

Today's AI agents can fluently invoke tools and complete complex tasks, yet their long-term growth is limited by the absence of a unified way to accumulate and transfer the capabilities they discover. Without a mechanism for skill consolidation, agents constantly rediscover the same solutions in isolation. SkillNet is an open infrastructure for creating, evaluating, and organizing AI skills at scale. It introduces a three-layer Skill Ontology — a taxonomy, a relation graph, and a package library — and transforms fragmented experience from execution trajectories, open-source repositories, and documents into a structured network of more than 200,000 candidate skills, curated down to 150,000+ high-quality entries. Every skill is scored along five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-Awareness. On three benchmarks — ALFWorld, WebShop, and ScienceWorld — agents augmented with SkillNet improve average reward by +40% and cut interaction steps by −30% across DeepSeek V3.2, Gemini 2.5 Pro, and o4 Mini.

02 · MOTIVATION

Why Skills, Why Now

Richard Sutton has argued that we are now in the era of experience. Intelligence is grounded less in static knowledge acquisition and more in the efficient retrieval and adaptive reuse of heuristics distilled from prior runs. That shift marks the arrival of the agent era, where AI moves beyond static question answering to orchestrate long-horizon, executable tasks.

But current systems lean on manual engineering or transient in-context learning. Humans routinely consolidate episodic experience into reusable schemas — a programmer internalizes an algorithm's logic rather than memorizing its syntax — yet contemporary AI struggles to bridge the gap between fleeting context and durable capability. The central challenge is no longer learning from experience, but turning fragmented experience into composable skill units.

ERA 01

The Symbolic Era

Rigid symbolic logic was interpretable but brittle and did not scale.

ERA 02

The Deep Learning Era

Knowledge became parametric and powerful, but opaque and hard to modularize or reuse.

ERA 03

The Agentic Era

Skills emerge as simple, transferable capability units that separate intelligence from monolithic parameter spaces.

A unified skill framework
SkillNet turns fragmented agent experience into a structured network of modular, composable skills with rich relational modeling, giving teams a scalable foundation for actionable knowledge engineering.
A rigorous 5-axis evaluation protocol
Every skill is quantitatively measured along Safety, Completeness, Executability, Maintainability, and Cost-Awareness — ensuring the repository remains reliable as it scales.
An open ecosystem
A 200k+ candidate / 150k+ curated skill repository, a Python toolkit (skillnet-ai), and benchmarks that demonstrate significant gains in agent planning and execution.

03 · PRIMER

What Is an Agent Skill?

A skill is a lightweight, modular, reusable abstraction for extending an agent's capabilities. Functionally, a skill is a structured folder containing a central SKILL.md file that defines metadata — name, a short description, and usage conditions — alongside step-by-step instructions, and optionally bundled code, templates, or reference assets.

Skills give agents on-demand access to reusable procedural knowledge: automating a data analysis pipeline, performing domain reasoning, or generating structured reports. The self-documenting nature of skills makes each capability easy to audit and iterate on — they become the interface through which business logic gets institutionalized.

Think of a skill as a Claude Code "skill folder"

If you use Claude Code or similar frameworks, an "agent skill" looks exactly like the skill folders you already know: a directory with a SKILL.md that tells the agent when to activate ("use this for Postgres migrations"), and how to run ("1. dump schema 2. diff 3. run migration"). The agent only loads the full instructions when the task matches. That's why this paper treats skills as modular units — they're literally copy-pasteable between projects.

STEP 01

Discovery

The agent loads minimal metadata — name and description — to decide which skills are potentially relevant for the task at hand.

STEP 02

Activation

When a task matches a skill's description, the agent reads the full SKILL.md instructions and prepares any bundled resources.

STEP 03

Execution

The agent follows the instructions step-by-step, optionally invoking bundled code or consuming referenced assets to complete the task.

04 · SYSTEM

SkillNet Overview

SkillNet systematically creates, evaluates, and organizes high-quality skills for agent systems. Its goal is to transform fragmented agent experience and human knowledge into reusable and verifiable skill entities, enabling scalable and reliable capability growth.

MODULE 01

Skill Creation

Analyzes diverse inputs — user trajectories, office documents, GitHub projects, direct prompts, and open internet resources — and generates new skills by extracting executable patterns and structuring them as reusable capabilities.

MODULE 02

Skill Evaluation

Filters and grades candidate skills along five dimensions — Safety, Completeness, Executability, Maintainability, and Cost-Awareness — ensuring only high-quality skills enter the repository.

MODULE 03

Skill Analysis

Automatically mines structural and functional relationships among skills, building a large-scale typed graph that captures similarity, hierarchy, composition, and dependency for global reasoning and composition.

MODULE 04

Open Resources

A curated repository, a front-end website, an open API, and the skillnet-ai Python toolkit — one ecosystem for skill management, search, evaluation, and contribution.

05 · ONTOLOGY

The Skill Ontology

The Skill Ontology organizes individual skills into a structured, composable network. The architecture stacks three progressive layers: a taxonomy, a relation graph, and a package library. Taken together, they transform a flat bag of skills into a navigable knowledge structure.

Three-layer skill ontology diagram with taxonomy, relation graph, and package library. — **Figure 3.** The Skill Ontology has three layers. The Skill Taxonomy defines functional categories and tags. The Skill Relation Graph models inter-skill edges (`similar_to`, `belong_to`, `compose_with`, `depend_on`). The Skill Package Library bundles skills into modular packages such as `data-science-visualization` or `e2e-browser-testing`.

LAYER 01

Skill Taxonomy

A multi-level hierarchy built from category and tag relations. Broad domains (Development, AIGC, Research, Science, Testing, …) are refined into fine-grained tags (frontend, llm, physics), providing a high-level semantic skeleton.

LAYER 02

Skill Relation Graph

Instantiates abstract tags into concrete Skill Entities (e.g. matplotlib, playwright) and connects them with typed edges: similar_to, compose_with, belong_to, depend_on. This is the backbone for reasoning and planning.

Why four relation edges matter

A flat list of 150,000 skills is unsearchable. The relation graph is what turns the repository into something an agent can reason over:

similar_to lets the agent swap one skill for another when the first fails (redundancy).
compose_with tells the agent which skills feed each other (automatic pipelines).
depend_on prevents broken runs by surfacing prerequisites.
belong_to enables zoom-out planning by grouping atomic skills into larger workflows.

Concretely: if an agent wants to "build a Next.js landing page", it can discover react-patterns via belong_to nextjs-expert, verify it depend_on react, and swap it out for frontend-patterns if needed.

LAYER 03

Skill Package Library

Physical organization of skills. Individual skills are encapsulated into Skill Packages via the packaged_in relation (e.g. react-nextjs-fullstack, data-science-visualization), enabling modular release and deployment.

06 · PIPELINE

How New Skills Enter the Network

SkillNet treats skills as intermediate capability units that bridge declarative knowledge and concrete, executable programs. An automated creation pipeline turns four kinds of input into standardized skills: execution trajectories and conversational logs, open-source GitHub repositories, semi-structured documents (PDF, PowerPoint, Word), and direct natural-language prompts from users.

Automatic construction does not mean indiscriminate accumulation. A data-driven filtering and consolidation pipeline deduplicates, validates, classifies, scores, and finally selects each candidate, so the repository grows as a self-evolving skill ecosystem rather than a static collection.

FILTERING PIPELINE

Deduplication — jointly compare directory structure and MD5 hashes of skill Markdown files.
Filtering — rule-based validation plus model-based checking to eliminate low-quality or semantically empty skills.
Categorization & Tagging — classify each skill into one of 10 functional categories and attach fine-grained tags.
Evaluation — grade along Safety, Completeness, Executability, Maintainability, Cost-Awareness.
Selective Consolidation — keep only skills that survive all previous stages.

A self-evolving skill ecosystem, not a static collection.

07 · EVALUATION

Multi-Dimensional Skill Evaluation

Although skill creation matters, the usefulness of a repository ultimately depends on its reliability. SkillNet bridges the gap left by existing repositories — which rarely have standardized assessment — with a five-axis rubric plus sandboxed execution checks for runtime correctness.

AXIS 01

Safety

Hazardous system operations (e.g. unauthorized file deletion) and robustness against prompt injection or adversarial manipulation.

AXIS 02

Completeness

Whether all critical procedural steps are captured and prerequisites, dependencies, and execution constraints are explicit.

AXIS 03

Executability

Agents can actually run the skill in a sandboxed environment, with hallucinated tool calls and ambiguous instructions flagged.

AXIS 04

Maintainability

Modularity and composability: can the skill be updated locally without breaking global dependencies or backward compatibility?

AXIS 05

Cost-Awareness

Time latency, compute, and API cost awareness, to support efficiency optimization in production.

The 5 axes, translated for practitioners

Think of a skill as a library your team ships: these five axes mirror the same quality gates you already use in code review.

Safety ≈ "does it have dangerous side effects?" (like a migration that drops tables without a prompt)
Completeness ≈ "does the README actually explain every step?" (prerequisites + edge cases)
Executability ≈ "does it run green on CI?" (actual sandbox execution, not just LLM opinion)
Maintainability ≈ "can I fork it without breaking downstream?" (modularity)
Cost-Awareness ≈ "what's the p99 latency and the API bill?" (production operability)

That analogy is why the next figure matters: the automated evaluator agrees with PhD human reviewers on every axis — it's essentially an automated code reviewer for agent skills.

Stacked-bar quality distribution for the five dimensions and a heatmap of MAE and QWK between LLM-based and human raters. — **Figure 4.** (a) Quality distribution of curated skills across the five dimensions using a three-level grading scheme — most skills are rated Good or Average. (b) Reliability check on 200 randomly sampled skills: the automated LLM-based evaluator achieves MAE < 0.03 and near-perfect QWK (1.000) across every dimension against three PhD human annotators.

KEY INSIGHT

Across 200 randomly sampled skills reviewed blind by three PhD-level computer science annotators, the LLM-based evaluator achieved Mean Absolute Error under 0.03 and near-perfect Quadratic Weighted Kappa (1.000) on every dimension — strong evidence that SkillNet's automated quality gate is trustworthy at scale.

08 · RELATIONS

Mining the Skill Relation Graph

Beyond isolated skill creation and evaluation, large-scale repositories introduce a new challenge: systematically understanding, organizing, and exploiting relations between skills. SkillNet formulates skill analysis as a structured relation discovery problem and automatically annotates four core relation types between skills.

similar_to Functionally equivalent skills — enables redundancy detection and substitution.

belong_to One skill is a sub-component of a larger workflow, capturing hierarchy.

compose_with Two skills are frequently co-invoked; one consumes the other's output.

depend_on A skill cannot execute without a prerequisite (e.g. environment setup).

How the graph is actually built (without manual labeling)

Hand-labeling relations between 150,000 skills is impossible. SkillNet's trick is a three-stage pipeline: (1) embedding similarity proposes candidate similar_to edges; (2) static analysis of code and execution traces proposes depend_on and compose_with edges; (3) an LLM reasoner reviews the candidates and labels the edge type. This is the same recipe modern knowledge-graph construction uses, applied to executable code instead of Wikipedia text.

SkillNet builds these relations through a hybrid pipeline: semantic embedding similarity, dependency extraction, and execution-trace alignment generate candidate edges, then LLM reasoning refines them into a directed, typed, multi-relational graph. The result is a living skill graph that supports global reasoning, workflow composition, and task-oriented skill collection releases.

09 · ECOSYSTEM

Repository, API, and Python Toolkit

SkillNet ships as a full open infrastructure: a large-scale curated skill repository, a front-end website at skillnet.openkg.cn, an open access API with both keyword and vector search, and a versatile Python library plus CLI tool skillnet-ai.

Through this ecosystem, users can browse skills by category, view API documentation, contribute new skills, search and download skills directly into local workspaces, create structured skills from trajectories or codebases, evaluate across the five quality dimensions, and analyze relations between skills at scale.

Code snippets showing skillnet-ai CLI usage on the left and Python API usage on the right. — **Figure 5.** Practical use of `skillnet-ai`. The package offers a unified functional experience through both a command-line interface (left) for interactive usage and a Python library (right) for seamless integration into AI development pipelines.

200k+ candidate skills

150k+ curated high-quality

20+ task-specific collections

10 · BENCHMARKS

Benchmarks: +40% Reward, −30% Steps

SkillNet is evaluated on three text-based simulated environments. ALFWorld is an embodied household benchmark where agents navigate and manipulate objects for daily tasks. WebShop simulates realistic online shopping with product search, comparison, and constrained purchasing. ScienceWorld presents a virtual scientific laboratory in which agents perform experiments and reason about results.

Baselines are ReAct, ExpeL, and a standard Few-Shot approach. The SkillNet variant is augmented with task-specific skill collections; during evaluation agents dynamically select, activate, and execute the most relevant skills. Backbones: DeepSeek V3.2, Gemini 2.5 Pro, and o4 Mini. Seen and unseen test splits have no overlap, preventing data leakage.

ReAct and ExpeL — the baselines to beat

ReAct interleaves Reasoning (think step) and Acting (tool call) in a loop — the most widely used baseline for LLM agents. ExpeL (Experiential Learner) adds a memory of past successful trajectories and retrieves them as few-shot examples at inference time. Both treat experience as in-context prompting. SkillNet goes further: it turns experience into named, versioned, evaluated capability units that any agent can install. That's why the gains hold across three different backbone models — the improvement isn't coming from better prompting, it's coming from giving the agent executable tools that survive between tasks.

Grouped bar chart comparing average reward (top row) and average steps (bottom row) across ALFWorld, WebShop, and ScienceWorld for ReAct, Expel, Few-Shot, and +SkillNet. — **Figure 6.** Performance comparison across methods and backbone models. SkillNet consistently delivers higher average reward (top row) and fewer execution steps (bottom row) across ALFWorld, WebShop, and ScienceWorld.

TABLE 1 · MAIN RESULTS

Model	Method	ALFWorld Seen		ALFWorld Unseen		WebShop		ScienceWorld Seen		ScienceWorld Unseen
Model	Method	R↑	S↓	R↑	S↓	R↑	S↓	R↑	S↓	R↑	S↓
DeepSeek V3.2	ReAct	66.43	19.51	69.40	19.27	31.55	24.06	69.86	17.59	64.67	19.26
	ExpeL	67.86	18.86	76.12	17.41	29.23	24.00	74.91	15.98	74.09	17.53
	+ SkillNet	80.60	14.54	83.57	14.81	46.18	17.84	84.87	11.89	81.31	12.48
Gemini 2.5 Pro	ReAct	60.00	18.72	61.94	19.18	31.66	22.12	58.24	18.42	56.13	19.07
	ExpeL	68.57	17.88	70.15	17.04	33.12	19.31	72.76	15.01	67.37	14.91
	+ SkillNet	91.43	12.80	91.04	11.96	53.02	14.91	88.84	11.49	86.26	11.30
o4 Mini	ReAct	45.71	23.31	49.25	23.33	24.19	22.02	64.89	15.12	59.93	14.99
	ExpeL	56.43	21.35	58.96	21.85	26.71	21.91	67.95	13.65	65.68	13.95
	+ SkillNet	68.57	18.94	73.28	17.08	36.21	18.79	73.24	13.30	71.06	12.35

R denotes average reward (↑ higher is better); S denotes average steps (↓ lower is better). Best +SkillNet numbers are highlighted in bold. Results are reproduced from the paper.

Read rows as: Model × Method. R↑ is the benchmark reward (higher is better) and S↓ is the number of steps to completion (lower is better). The + SkillNet row in each model block is the headline result — compare it against its ReAct row to see the raw lift.

HEADLINE RESULT

Integrating SkillNet improves average reward by 40% and reduces interaction steps by 30% relative to ReAct. Gains scale with backbone size: +15.7 R for o4 Mini, +28.5 R for Gemini 2.5 Pro. Improvements hold under both seen and unseen splits, evidence that SkillNet adds capability on top of parametric knowledge rather than memorization.

11 · APPLICATIONS

Autonomous Science & Coding Agents

SkillNet bridges high-level user intentions and executable agent actions by organizing specialized skills into coherent, evaluable workflows. The three scenarios below illustrate its utility in scientific discovery, coding, and integration with a general-purpose personal agent.

Application scenario diagram with a top task workflow and two example skill cards for science and code. — **Figure 7.** Example of SkillNet application scenarios. A user task is decomposed into actionable steps such as Data Processing, Mechanism Analysis, Target Validation, and Reporting (top), each backed by SkillNet skills like `cellxgene-census`, `kegg-database`, `gget`, and `citation-manage`. The bottom shows representative skill cards for Science and Coding scenarios with multi-dimensional evaluations.

USE CASE 01

Autonomous Scientific Discovery

SkillNet chains data processing, mechanism analysis, target validation, and report generation skills. It can, for example, take raw single-cell RNA-seq data, cluster it, map key genes onto biological pathways, cross-validate their clinical significance, and synthesize the results into a citation-grade scientific report.

USE CASE 02

Autonomous Coding Agent

For large-scale software engineering tasks, SkillNet schedules code analysis skills to build a structured view of the system, decomposes requirements, assesses regression risks, and runs a closed-loop implementation → testing → validation pipeline backed by real-time observability.

USE CASE 03

OpenClaw Integration

OpenClaw, a customizable open-source personal agent framework, loads SkillNet as a skill. With lazy-loading metadata, it performs pre-task search against the repository, installs matched skills on demand, and periodically analyzes the local workspace — turning a general-purpose agent into a continuously self-improving system.

Why OpenClaw integration is the real test

Benchmarks like ALFWorld live in controlled sandboxes. The OpenClaw scenario is different: a real, production, installable personal agent. When SkillNet works here — with lazy skill loading, pre-task search against a 150k repository, on-demand install from GitHub URLs, and periodic self-analysis — it stops being a paper and becomes an ecosystem: community skills feed tasks, successful tasks get consolidated into new skills, and the agent improves for everyone who shares the same SkillNet. That flywheel is why the authors call it the "primary unit of knowledge integration".

12 · RELATED WORK

How SkillNet Differs from Existing Hubs

Prior LLM-agent work has explored skill abstraction (Reflexion, ExpeL, Voyager) and community-driven repositories (ClawHub, SkillHub, SkillsMP, Skills.sh). Those platforms predominantly operate as static package managers or marketplaces — they rely on manual curation, simple community metrics like stars or downloads, and lack both automatic skill creation and relational modeling.

TABLE 2 · FEATURE COMPARISON

Dimension	SkillNet	ClawHub	SkillsMP	SkillHub	Universal Aggregator
Core Positioning	Full-lifecycle infrastructure (create, evaluate, connect)	npm-like version management hub	Massive open-source ecosystem catalog	Premium marketplace (curated & stacked)	Universal skill directory & leaderboard
Automated Creation	Supported (from trajectories, code, docs via LLMs)	Not supported	Not supported	Not supported	Not supported
Quality Evaluation	Multi-dimensional (safety, completeness, executability, maintainability, cost)	None (highlighted recommendations)	Basic filtering (GitHub stars)	Built-in LLM rating (S/B-rank)	Community leaderboard (downloads/trends)
Relationship Analysis	Supported (extracts structural relations to build a skill graph)	Not supported	Not supported	Manual pre-configured stacks	Not supported
Integration & Tooling	Python SDK & CLI (`skillnet-ai`)	npx-based (`clawhub install`)	GitHub aggregation & CLI	Desktop 1-click install & CLI	Universal CLI (`npx skills add`)
Search Mechanism	Vector semantic + keyword	Vector semantic + version control	AI semantic + keyword	Semantic + ranking filters	Global directory search
Skill Quantity	250k+ Total · 150k+ Curated	~9k+	~261k+	~21k+	~71k+

13 · TAKEAWAY

Skills as the Bridge Between Memory and Workflows

SkillNet is grounded in a unified view of three complementary constraints on an agent's generative capability. Workflows impose explicit procedural structure — reliable but rigid. Memory accumulates contextual experience — adaptive but unbounded. Skills bridge these extremes by packaging reusable capability units that both constrain generation and organize memory into actionable patterns.

Looking ahead, the emerging "one-person lab" envisions a single expert orchestrating a society of agents. SkillNet makes skills the primary unit of knowledge integration and delegation: individuals curate skill repositories, agents compose them into workflows, and memory continuously refines them through experience — transforming isolated automation into cumulative machine expertise.

DIRECTION 01

Open-World Skill Evolution

Automatic skill discovery, abstraction, and cross-domain transfer remain challenging in open-world settings. Industry-specific, privately curated SkillNets may themselves become foundational components of future agent infrastructure.

DIRECTION 02

Model-Skill Synergy

How to leverage neuro-symbolic integration and memory mechanisms so that skill structures guide model decision paths — and how skill hierarchies should restructure as model capabilities evolve — is still an open question.

DIRECTION 03

Multi-Agent Collaboration

In multi-agent environments, SkillNet can serve as a shared representation and exchange layer, supporting collaborative planning, knowledge transfer, and cumulative cross-agent experience.

DIRECTION 04

Digital Avatars & Cumulative Expertise

Continuously consolidating agent behaviors into reusable skills opens a path toward digital avatars whose capabilities are progressively distilled from accumulated skills.

Limitations

Two honest caveats. First, skill coverage is inevitably incomplete: many private or highly specialized capabilities cannot be captured, and low-frequency tacit abilities resist explicit linguistic description. Second, the quality of self-constructed skills cannot be fully guaranteed — the evaluation pipeline filters obvious problems but subtle errors may remain.

References (showing a curated subset)

Richard S. Sutton. The Era of Experience. 2024.
OpenAI. GPT-4 Technical Report. 2023.
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
Andrew Zhao et al. ExpeL: LLM Agents Are Experiential Learners. AAAI 2024.
Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
Shunyu Yao et al. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. NeurIPS 2022.
Mohit Shridhar et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR 2021.
Ruoyao Wang et al. ScienceWorld: Is your Agent Smarter than a 5th Grader? EMNLP 2022.
Guanzhi Wang et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. 2023.
Anthropic. Introducing Claude Code. 2024.
Tao Yu et al. Towards Unified Skill Hubs for Autonomous Agents. 2025.
DeepSeek-AI. DeepSeek V3.2 Technical Report. 2025.
Google. Gemini 2.5 Pro Report. 2025.
OpenAI. o4 Mini Model Card. 2025.

The full reference list (50+ entries) is available in the arXiv PDF. The items above are a curated subset of the most directly cited works.