SoK: Agentic Skills — Beyond Tool Use in LLM Agents

What Is an Agentic Skill?

A skill is more than a tool call or a prompt. It is a reusable, callable module that encapsulates a sequence of actions or policies enabling an agent to achieve a class of goals under recurring conditions — complete with its own applicability logic, execution policy, termination criteria, and callable interface.

2.1 Formal Definition

An agentic skill is formalized as a four-tuple that captures the essential properties distinguishing it from related abstractions. Let an agent interact with environment E via action space A, observation space O, and goal space G:

Definition 1 (Agentic Skills)

Each component plays a distinct role in making skills simultaneously executable, reusable, and governable — the three properties that no other existing abstraction fully provides.

\[ S = (C, \pi, T, R) \]

Breaking Down S = (C, π, T, R)

This four-tuple is the paper's core claim: a skill is not just a callable function — it's a contract. Think of it like a microservice with a service-level agreement:

C (Condition) — When can this skill be invoked? e.g., "I can navigate websites, but not if the page requires authentication"
π (Policy) — How does it execute? This is the actual behavior — code, NL instructions, or a learned policy
T (Termination) — When does it stop? Success, failure, or timeout conditions — not just "run until done"
R (Interface) — What do other skills or the agent see? Inputs, outputs, side effects — the API boundary

Without T and C, a "skill" is just a function call. The contract is what makes it composable and safe to reuse across contexts.

C

Applicability Condition

Maps observations and goals to {0,1}: determines whether this skill is appropriate for the current context. Acts as a gating function — skills only activate when their conditions are met. Think of it as the skill's 'when to use me' knowledge.

π

Executable Policy

Maps observations and history to actions: the core logic of the skill. Can be a prompt template, Python function, RL policy, or hybrid. When π selects another skill from library Σ instead of a primitive action, hierarchical composition arises.

T

Termination Condition

Specifies when the skill has completed — successfully or not — relative to the current goal. This is what enables composability: callers know exactly when control returns to them. Without T, skills cannot be safely chained.

R

Callable Interface

Defines the skill's programmatic boundary: name, parameter schema, and return type. Enables the agent, other skills, and external orchestrators to invoke the skill reliably. Without R, internal knowledge cannot be used programmatically.

Formal structure of an agentic skill S = (C, pi, T, R) — Fig. 1: Internal anatomy of an agentic skill. Observations O enter the applicability gate C; the policy π produces actions A; the termination condition T determines whether to continue or halt. The interface R wraps the entire module as a callable API boundary.

2.2 Skills vs. Related Abstractions

Agentic skills occupy a distinct position in the design space — they are not just tools, plans, or memories. The table below compares them across five key dimensions:

The key distinction: a tool answers "what can I call?" — a skill answers "what do I know how to do, and when?" Tools are atomic API endpoints. Skills are reusable workflows that include the judgment about when and how to use those tools. A skill can call multiple tools, maintain state, handle failures, and return structured results — all defined in advance and reusable across different tasks.

Skills vs. related abstractions comparison table — Table I: Skills vs. related abstractions across five dimensions: unit of reuse, execution semantics, verification surface, composability, and governance surface.

vs. Tools

A tool is an atomic primitive (e.g., a web-search API) with a fixed interface and no internal decision-making. A skill may invoke tools, but extends them with applicability logic, multi-step sequencing, and explicit termination criteria. The distinction is like a system call vs. a library routine.

vs. Plans

A plan is a one-time reasoning artifact that decomposes a task into sub-goals. Plans are session-scoped and not directly executable without further interpretation. Skills persist across sessions, carry executable policies, and expose callable interfaces.

vs. Memory

Episodic and semantic memory stores observations and facts. Skills are procedural memory: they encode how to act, not what happened. This mirrors the cognitive psychology distinction between knowing-that (declarative) and knowing-how (procedural).

vs. Prompt Templates

Prompt templates are text fragments injected into the context window with no applicability conditions or termination logic. They cannot self-select, compose hierarchically, or be governed independently. Skills subsume and formalize the best patterns from prompt engineering.

The Skill Lifecycle: From Discovery to Deployment

Skills are not static artifacts — they are evolving system components shaped by interaction, feedback, and deployment constraints. The lifecycle comprises seven stages tracing a skill from initial formation to eventual retirement or update.

01

Discovery

Identifying recurring task patterns from interaction logs or demonstrations. The key question: which behaviors are frequent enough and stable enough to warrant encapsulation as a reusable skill?

02

Practice & Refinement

Trial-and-error execution with feedback. The skill candidate is tested, its policy is refined, and edge cases are handled. Systems like Voyager implement this as an iterative loop with environment feedback.

03

Distillation

Compressing trajectory experience into a compact, reusable form — the four-tuple S = (C, π, T, R). This stage transforms ephemeral agent experience into persistent procedural knowledge.

04

Storage

Indexing skills in a searchable library. Skills must be stored with rich metadata (name, description, applicability conditions) to enable efficient retrieval. Vector databases and semantic indices are common approaches.

05

Retrieval & Composition

Selecting and composing appropriate skills for a given task. Retrieval uses embedding-based similarity search or LLM routing. Composition creates hierarchical skill trees for complex long-horizon tasks.

06

Execution

Running the skill in a sandboxed runtime environment. Execution must enforce permission boundaries and monitor for anomalous behavior. The sandboxing approach differs between code skills (containerization) and NL skills (context window isolation).

07

Evaluation & Update

Continuously measuring skill performance and updating or retiring underperforming skills. A skill that underperforms triggers a new cycle from Practice/Refinement. A skill that becomes unsafe or obsolete is retired from the library.

Why the Feedback Loop Matters

The lifecycle isn't linear — it's a loop. When a skill underperforms (Stage 7 signals failure), the system should automatically flag it for refinement (back to Stage 2) or flag a capability gap for new skill creation (back to Stage 1). Most current systems lack this auto-update loop, which means skills silently degrade as APIs and environments change. This is one of the paper's key open research problems.

Agentic skill lifecycle illustration — Conceptual illustration of the 7-stage agentic skill lifecycle as a cyclic flow.

Seven Design Patterns for Skill Packaging

Across 65 analyzed systems, the paper identifies seven recurring design patterns capturing how skills are packaged, loaded, and executed in practice. Each pattern makes different trade-offs between context cost, determinism, composability, and governance.

P1

Metadata-Driven Disclosure

VoyagerLARS

✓ Scalable, low context cost ⚠ Metadata poisoning

P2

Code-as-Skill (Executable Scripts)

LATMCodeAct

✓ High determinism & composability ⚠ Sandbox escape risk

P3

Workflow Enforcement

LangChainDEPS

✓ Predictable multi-step flows ⚠ Rigid; hard to update

P4

Self-Evolving Skill Libraries

VoyagerJARVIS

✓ High adaptability ⚠ Uncontrolled skill drift

P4 in practice: Self-evolving libraries (like Voyager in Minecraft) let the agent write new skills as it explores. The risk is quality drift — the agent may generate subtly broken skills that pass initial checks but fail in edge cases. Without a verification gate at skill admission, the library fills with unreliable code. Think of it as an auto-merging CI/CD pipeline with no tests.

P5

Hybrid NL+Code Macros

HuggingGPTToolBench

✓ Flexible & human-readable ⚠ Ambiguous NL/code boundary

P6

Meta-Skills

SKILL-4-LLMMetaGPT

✓ Orchestrates complex pipelines ⚠ High context cost, low determinism

P7

Plugin / Marketplace Distribution

OpenAI GPT StoreClawHub

✓ Community scale & discovery 🔴 Supply-chain attack (ClawHavoc)

Why Marketplace Skills (P7) Carry the Highest Risk

When skills are distributed via a marketplace (P7), the threat surface explodes. Unlike P1–P6 where skills are authored in-house or by the agent itself, P7 skills come from unknown third parties. A single malicious skill can be installed by thousands of users before detection. The ClawHavoc attack (Section 7) demonstrated this exactly: 1,184 malicious skills reached 36.8% of active users through a marketplace. The combination of reach (thousands of users) and privilege (agent-level execution) makes P7 the highest-risk distribution pattern in the taxonomy.

P7 Marketplace: spectrum from human-controlled to autonomous — Fig. 3: P7 Marketplace Distribution — the spectrum from human-controlled to fully autonomous skill execution. P7 acts as an overarching umbrella distribution channel for all other patterns.

Seven design patterns: systems, strengths, weaknesses, risks — Table III: Seven design patterns with representative systems, strengths, weaknesses, and primary security risks.

Pattern trade-off matrix: context cost, determinism, composability, governance — Table IV: Pattern trade-off matrix — context cost, determinism, composability, and governance (L=Low, M=Medium, H=High). P2 (Code-as-skill) achieves H determinism + H composability at L context cost.

Key insight: Code-as-skill (P2) offers the best engineering trade-offs — high determinism, high composability, and low context cost — but requires sandboxing. Marketplace distribution (P7) maximizes scale but introduces the highest supply-chain risk, as demonstrated by ClawHavoc.

Representation × Scope Taxonomy

Orthogonal to the seven design patterns, skills can also be classified by what they are (representation) and what environments they operate over (scope). This two-dimensional taxonomy reveals the coverage gaps in current research.

Skill Representation Types

NL

Natural Language (NL)

Procedural instructions written in natural language (playbooks, recipes). Easy to author and understand by humans. Low determinism — execution depends on the LLM interpreter. Dominant in early agentic systems.

Code

Executable scripts (Python, JavaScript) with deterministic behavior. High composability — can be unit-tested, version-controlled, and formally verified. Requires sandboxed execution environment to mitigate code injection risks.

Policy

Policy (Learned)

Neural network policies or RL-trained controllers. Highly adaptive to distribution shifts but difficult to inspect or audit. Primarily used in robotics and embodied AI settings where discrete NL instructions are insufficient.

Hybrid

Combinations of NL instructions + executable code + optional learned components. Provides flexibility while maintaining some degree of auditability. Most production systems converge toward hybrid representations.

Operational Scope Environments

🌐

Web

Browser navigation, web scraping, form interaction. Well-studied domain with benchmarks like WebArena and Mind2Web. Primary attack surface for confused deputy attacks.

💻

OS / Desktop

File system, process management, GUI automation. Covered by benchmarks like OSWorld. High-privilege environment — skills operating here require strict permission boundaries.

⚙️

Software Engineering

Code generation, debugging, testing, repository management. SWE-bench and SWE-agent are key benchmarks. Code skills show +4.5pp improvement from curated skills in SkillsBench (lowest domain gain).

🤖

Robotics

Physical robot control, navigation, manipulation. Primarily policy-based skills. SayCan and NavCat are representative systems. Evaluation is challenging due to physical world variance.

🔗

Multi-agent

Coordination across multiple collaborating or competing agents. Meta-skills (P6) and marketplace patterns (P7) are most relevant here. Cross-tenant skill access introduces additional governance complexity.

Table II: Representative systems → lifecycle stage × representation mapping (click to expand)

Representative systems mapped to lifecycle stages and skill representation — Table II: Representative systems mapped to lifecycle stages (Discovery through Evaluation) and primary skill representation (Code, NL, Latent, Hybrid).

Table V: Comprehensive system survey — patterns, representation, scope, lifecycle coverage (click to expand)

⚠ Governance Gap: The comprehensive survey (Table V) reveals that most academic systems lack explicit governance mechanisms. The governance column is predominantly empty — a critical gap that the paper identifies as the most urgent open challenge for production-grade skill-based agents.

Security, Trust, and Governance

How the ClawHavoc Attack Worked

The attack exploited the trust that marketplace users place in published skills:

Step 1 — Poisoned Upload: Attackers published seemingly useful productivity skills to the ClawHub marketplace
Step 2 — Prompt Injection: The skills contained hidden instructions that, when executed by an LLM agent, caused it to exfiltrate sensitive data (API keys, credentials, wallet addresses) to attacker-controlled endpoints
Step 3 — Supply-Chain Spread: Because marketplaces allow one-click installation, the malicious skills propagated to 36.8% of active users before detection

The attack is analogous to a malicious npm package — except the payload isn't code injection, it's prompt injection, which is far harder to detect statically.

Threat Model: Six Primary Attack Categories

Poisoned Skill Retrieval

Crafting skill metadata to cause retrieval mechanisms to surface malicious skills in response to benign queries — analogous to SEO poisoning. Exploits Pattern-1 (metadata-driven disclosure).

Malicious Skill Payloads

A skill's policy π contains instructions or code that perform unauthorized actions when executed. In code skills (P2), this resembles traditional software supply-chain attacks. In NL skills (P5), the payload is a form of prompt injection.

Cross-Tenant Leakage

In multi-agent or multi-user deployments with shared skill repositories, skills authored by one tenant may access data or resources belonging to another. Critical risk in enterprise deployments.

Skill Drift Exploitation

Skills safe at authoring time may become unsafe as the environment evolves. An attacker controlling part of the environment (e.g., a web page a skill navigates) can manipulate behavior without modifying the skill code.

Confused Deputy via Environmental Injection

Untrusted observations (web pages, user documents) contain adversarial instructions that coerce the agent into misusing an otherwise benign privileged skill. The skill itself is uncompromised; the attack exploits the agent's instruction-following.

Applicability Condition Poisoning

Manipulating input to C such that a malicious skill returns C(o,g) = 1 universally, activating in contexts where it should not. Enables malicious skills to activate across broad task categories, maximizing attack surface.

Trust Tier Model: Four-Level Progressive Disclosure

T1

T1: Metadata Only

The agent sees only the skill name and description. No instructions or code are loaded. Supports discovery without execution risk. Safe for all untrusted skills.

T2

T2: Instruction Access

The agent loads the skill's natural-language instructions into its context window. Requires read-only mode enforcement during loading. Prompt injection risk at this tier must be mitigated by architectural isolation.

T3

T3: Supervised Execution

The skill can execute actions (tool calls, code execution) but each action requires user approval or runs within a constrained sandbox. Appropriate for skills from verified but not fully trusted sources.

T4

T4: Autonomous Execution

The skill executes without per-action approval, subject to pre-configured permission boundaries and monitoring. Reserved for fully vetted, provenance-verified skills with demonstrated track records at T3.

Reading the Trust Tiers in Practice

The four tiers map to concrete deployment decisions:

T1 (Verified) — Audited by humans or formal tools. Safe to run with full permissions. Deploy in production.
T2 (Community) — From trusted sources but not individually audited. Moderate permissions; monitor for anomalies.
T3 (Unverified) — From unknown sources. Sandbox execution: no network access, no credential access.
T4 (Adversarial) — Assume malicious. Maximum isolation if executed at all. ClawHavoc skills were T4 once discovered.

Trust-tiered security threat model for agentic skill execution — Fig. 5: Trust-tiered security threat model. Four nested trust boundaries (T1–T4) alongside three primary attack vectors: Poisoned Retrieval, Malicious Payload, and Supply-Chain Attack. Skill exfiltration (API keys, credentials, crypto wallets) is the ultimate threat outcome.

ClawHavoc: The First Large-Scale Skill Supply-Chain Attack

The ClawHavoc campaign against OpenClaw's ClawHub skill registry provides the first large-scale empirical evidence of skill supply-chain exploitation. Within weeks of ClawHub's launch, security researchers identified the attack — concretizing every threat category in the paper's model.

1,184 malicious skills identified

36.8% of all published skills had security flaws

12 publisher accounts involved

60+ crypto wallet types targeted by AMOS stealer

The primary payload (Atomic macOS Stealer / AMOS) systematically harvested LLM API keys from .env files, cryptocurrency wallet keys across 60+ wallet types, browser credentials, and session tokens — enabling billing fraud, model abuse, and financial theft at scale.

Attack Vectors Through the Pattern Taxonomy

Poisoned Retrieval (P1)

Attackers cloned popular legitimate skills under near-identical names, exploiting Pattern-1's metadata-driven discovery to rank malicious versions alongside or above originals.

Malicious Code Payloads (P2)

Skills included reverse shells, credential-exfiltration webhooks, and social-engineering 'setup' instructions telling users to run curl | bash pipelines — exploiting P2's code execution trust.

Confused Deputy Injection

Prompt injection in skill documentation coerced the agent into executing malicious commands using its legitimate tool access — bypassing any skill-level trust check.

Applicability Condition Manipulation (C-Poisoning)

Overbroad skill descriptions ensured malicious skills activated across broad task categories (crypto, productivity, automation), maximizing attack surface through P1 metadata manipulation.

Acquisition, Composition, and Orchestration

How skills enter the library matters as much as how they are executed. The acquisition method shapes the skill's quality, generalizability, and governance properties.

Skill Acquisition Strategies

Human-Authored

Skills written directly by human developers or domain experts. Highest quality and trustworthiness. Standard in enterprise deployments. Scalability is the primary limitation.

Demonstration Distillation

Skills extracted from human or expert agent demonstrations via trajectory distillation. Balances scalability and quality. Key challenge: ensuring the extracted skill generalizes beyond the demonstrated contexts.

Self-Practice & Exploration

Agent autonomously discovers and creates skills through environment interaction. Highest scalability but lowest reliability. Benchmark evidence shows self-generated skills may degrade performance — systematic verification at admission is essential.

Hierarchical Skill Composition

When a skill's policy π selects another skill from library Σ instead of a primitive action, hierarchical composition arises — mirroring the option-subroutine structure in RL options framework. This enables complex long-horizon task completion from simpler atomic skills.

The failure recovery property is critical: when a sub-skill fails (T_γ = failure), control returns to the parent with sufficient context to either retry with a different skill or escalate to human oversight. Fig. 4 shows this flow in the 'Deploy Web App' example.

Skill retrieval and hierarchical composition with failure recovery — Fig. 4: Skill retrieval and hierarchical composition. A task triggers retrieval via embedding match or LLM routing. The high-level skill 'Deploy Web App' decomposes into sub-skills Setup DB, Config Server, and Run Tests. Failure triggers recovery with alternative skill selection.

Retrieval Mechanisms

Embedding Match

Vector similarity search over skill descriptions. Fast and scalable. May miss semantically-similar but lexically-dissimilar skills. Standard approach in most deployed systems.

LLM Routing

LLM reads task description and skill metadata to make a routing decision. Higher accuracy for nuanced disambiguation. Slower and more expensive. Best for critical or high-stakes skill selection.

Evaluating Agentic Skills

The paper proposes a five-dimensional evaluation framework and maps existing benchmarks to measurable skill properties. Key finding: no single benchmark covers all dimensions — comprehensive evaluation requires combining multiple benchmarks.

Five Evaluation Dimensions

Correctness

Does the skill achieve its intended outcome? Evaluated via ground-truth annotations or deterministic verifiers. For code skills, unit tests provide direct verification. For web interaction skills, environment state comparison checks goal completion.

Robustness

Does the skill maintain performance under input variations, environment perturbations, and edge cases? A robust skill handles both legacy and updated UI layouts, for example. Critical for production deployment longevity.

Efficiency

Token consumption, wall-clock time, tool call count, and API costs. Efficiency directly affects deployment cost and composability — inefficient sub-skills slow downstream workflows. Especially important for long-horizon tasks.

Generalization

Does the skill transfer to unseen tasks or domains? Out-of-distribution evaluation is challenging. Cross-website generalization (Mind2Web) and cross-application evaluation (OSWorld) provide partial evidence. Self-generated skills often fail here.

Safety

Does the skill avoid harmful actions, respect permission boundaries, and handle failures gracefully? Evaluated via adversarial testing, red-teaming, and runtime monitoring for unauthorized or unsafe behaviors. Directly links to the trust tier model.

Anchor Case Study: SkillsBench

SkillsBench (86 tasks, 7,308 trajectories) provides the most direct evidence to date for the value of curated skills. The study tests curated vs. self-generated skills across multiple domains, revealing dramatic performance differences.

+16.2pp average pass rate improvement with curated skills

24.3%→40.6% overall pass rate (baseline → curated)

+51.9pp improvement in healthcare domain (largest gain)

Contextualizing +16.2pp: A 16.2 percentage point improvement means the overall pass rate jumped from 24.3% to 40.6% — a roughly 67% relative gain. In agent benchmarks, most state-of-the-art advances are 2–5pp; +16pp is exceptional. The +51.9pp improvement in the healthcare domain is even more striking: in high-stakes domains, the gap between a curated skill ("verify dosage against formulary before recommending") and a self-generated one can be the difference between a correct answer and a dangerous one.

Open Problems and Research Roadmap

Skill-based agents expose several unresolved tensions that limit reliable deployment at scale. Five research directions stand out as most urgent.

10.1

Verified Autonomous Skill Generation

Automatically generated skills can degrade performance. The key obstacle is no longer skill generation itself, but verification at admission. Skills should be treated like software artifacts in CI/CD pipelines — evaluated against test suites before entering the library.

10.2

Unsupervised Skill Discovery

Most systems still rely on predefined curricula or explicit reward signals. Open-ended capability growth requires adapting unsupervised skill discovery from RL to LLM-based agents — allowing reusable behaviors to emerge from interaction traces without human scaffolding.

10.3

Formal Verification Across Representations

Code skills benefit from decades of software assurance. NL and policy skills lack equivalent verification tools. A practical governance challenge: combining rule-based analysis for executable components with semantic inspection for language-based skills.

10.4

Robustness Under Environmental Drift

Even correctly implemented skills may fail as APIs, tools, and workflows evolve. Proactive drift detection through continuous monitoring of execution statistics and deviation from historical behavior is largely absent from current systems.

10.5

Governance Economics and Liability

Marketplace ecosystems create incentives for contribution but expand supply-chain attack surfaces. Liability models must clarify responsibility among skill authors, platform operators, and users. Certification mechanisms should reward dependable skills and discourage risky ones.

Conclusion

Agentic skills are reusable procedural modules that extend LLM agents beyond single-turn tool use toward reliable long-horizon task execution. This SoK paper offers six contributions:

A unified formal definition S = (C, π, T, R) with precise boundary conditions separating skills from tools, plans, and memory
A seven-stage lifecycle model from discovery through evaluation and update
A seven-pattern design taxonomy for how skills are packaged, loaded, and executed in real systems
An orthogonal representation × scope taxonomy describing what skills are and what environments they act over
A security and governance analysis covering threat models, trust tiers, and the ClawHavoc case study
An evaluation framework with benchmark mapping and SkillsBench anchor study (+16.2pp from curated skills)

The field faces open challenges in unsupervised discovery, cross-representation verification, drift detection, and governance economics. Progress toward robust, verifiable, and certifiable skills will determine whether skill-based agents can be trusted in high-stakes real-world deployments.

Keywords

Agentic AI LLM Agents Skill Learning Procedural Knowledge Multi-agent Systems Security Governance cs.CR SoK

References (representative selection)

S. Zhou et al., "WebArena: A realistic web environment for building autonomous agents," ICLR 2024, arXiv:2307.13854.
J. Yang et al., "SWE-agent: Agent-computer interfaces enable automated software engineering," NeurIPS 2024, arXiv:2405.15793.
Z. Ji et al., "Measuring and augmenting large language models for solving capture-the-flag challenges," CCS 2025.
Y. Shen et al., "HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face," NeurIPS 2023.
C. Xie et al., "Can large language model agents simulate human trust behavior?" NeurIPS 2024.
S. Hong et al., "MetaGPT: Meta programming for a multi-agent collaborative framework," ICLR 2024.
Q. Wu et al., "AutoGen: Enabling next-gen LLM applications via multi-agent conversation," COLM 2024.
J. R. Anderson et al., "An integrated theory of the mind," Psychological Review, 2004.
J. E. Laird, The Soar Cognitive Architecture. MIT Press, 2012.
R. S. Sutton, D. Precup, S. Singh, "Between MDPs and semi-MDPs," Artificial Intelligence, 1999.