Beyond Tool Use in LLM Agents
Abstract
Agentic systems increasingly rely on reusable procedural capabilities — called agentic skills — to execute long-horizon workflows reliably. These callable modules package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate effectively across diverse tasks.
This paper maps the skill layer across the full lifecycle and introduces two complementary taxonomies: a seven-pattern design taxonomy for how skills are packaged and executed, and a representation × scope taxonomy. It analyzes the security implications grounded by the ClawHavoc case study, surveys evaluation approaches, and outlines open challenges for robust, verifiable, and certifiable skills.
A skill is more than a tool call or a prompt. It is a reusable, callable module that encapsulates a sequence of actions or policies enabling an agent to achieve a class of goals under recurring conditions — complete with its own applicability logic, execution policy, termination criteria, and callable interface.
An agentic skill is formalized as a four-tuple that captures the essential properties distinguishing it from related abstractions. Let an agent interact with environment E via action space A, observation space O, and goal space G:
Definition 1 (Agentic Skills)
Each component plays a distinct role in making skills simultaneously executable, reusable, and governable — the three properties that no other existing abstraction fully provides.
This four-tuple is the paper's core claim: a skill is not just a callable function — it's a contract. Think of it like a microservice with a service-level agreement:
Without T and C, a "skill" is just a function call. The contract is what makes it composable and safe to reuse across contexts.
Maps observations and goals to {0,1}: determines whether this skill is appropriate for the current context. Acts as a gating function — skills only activate when their conditions are met. Think of it as the skill's 'when to use me' knowledge.
Maps observations and history to actions: the core logic of the skill. Can be a prompt template, Python function, RL policy, or hybrid. When π selects another skill from library Σ instead of a primitive action, hierarchical composition arises.
Specifies when the skill has completed — successfully or not — relative to the current goal. This is what enables composability: callers know exactly when control returns to them. Without T, skills cannot be safely chained.
Defines the skill's programmatic boundary: name, parameter schema, and return type. Enables the agent, other skills, and external orchestrators to invoke the skill reliably. Without R, internal knowledge cannot be used programmatically.
Agentic skills occupy a distinct position in the design space — they are not just tools, plans, or memories. The table below compares them across five key dimensions:
The key distinction: a tool answers "what can I call?" — a skill answers "what do I know how to do, and when?" Tools are atomic API endpoints. Skills are reusable workflows that include the judgment about when and how to use those tools. A skill can call multiple tools, maintain state, handle failures, and return structured results — all defined in advance and reusable across different tasks.
A tool is an atomic primitive (e.g., a web-search API) with a fixed interface and no internal decision-making. A skill may invoke tools, but extends them with applicability logic, multi-step sequencing, and explicit termination criteria. The distinction is like a system call vs. a library routine.
A plan is a one-time reasoning artifact that decomposes a task into sub-goals. Plans are session-scoped and not directly executable without further interpretation. Skills persist across sessions, carry executable policies, and expose callable interfaces.
Episodic and semantic memory stores observations and facts. Skills are procedural memory: they encode how to act, not what happened. This mirrors the cognitive psychology distinction between knowing-that (declarative) and knowing-how (procedural).
Prompt templates are text fragments injected into the context window with no applicability conditions or termination logic. They cannot self-select, compose hierarchically, or be governed independently. Skills subsume and formalize the best patterns from prompt engineering.
Skills are not static artifacts — they are evolving system components shaped by interaction, feedback, and deployment constraints. The lifecycle comprises seven stages tracing a skill from initial formation to eventual retirement or update.
Identifying recurring task patterns from interaction logs or demonstrations. The key question: which behaviors are frequent enough and stable enough to warrant encapsulation as a reusable skill?
Trial-and-error execution with feedback. The skill candidate is tested, its policy is refined, and edge cases are handled. Systems like Voyager implement this as an iterative loop with environment feedback.
Compressing trajectory experience into a compact, reusable form — the four-tuple S = (C, π, T, R). This stage transforms ephemeral agent experience into persistent procedural knowledge.
Indexing skills in a searchable library. Skills must be stored with rich metadata (name, description, applicability conditions) to enable efficient retrieval. Vector databases and semantic indices are common approaches.
Selecting and composing appropriate skills for a given task. Retrieval uses embedding-based similarity search or LLM routing. Composition creates hierarchical skill trees for complex long-horizon tasks.
Running the skill in a sandboxed runtime environment. Execution must enforce permission boundaries and monitor for anomalous behavior. The sandboxing approach differs between code skills (containerization) and NL skills (context window isolation).
Continuously measuring skill performance and updating or retiring underperforming skills. A skill that underperforms triggers a new cycle from Practice/Refinement. A skill that becomes unsafe or obsolete is retired from the library.
The lifecycle isn't linear — it's a loop. When a skill underperforms (Stage 7 signals failure), the system should automatically flag it for refinement (back to Stage 2) or flag a capability gap for new skill creation (back to Stage 1). Most current systems lack this auto-update loop, which means skills silently degrade as APIs and environments change. This is one of the paper's key open research problems.
Across 65 analyzed systems, the paper identifies seven recurring design patterns capturing how skills are packaged, loaded, and executed in practice. Each pattern makes different trade-offs between context cost, determinism, composability, and governance.
VoyagerLARS
LATMCodeAct
LangChainDEPS
VoyagerJARVIS
P4 in practice: Self-evolving libraries (like Voyager in Minecraft) let the agent write new skills as it explores. The risk is quality drift — the agent may generate subtly broken skills that pass initial checks but fail in edge cases. Without a verification gate at skill admission, the library fills with unreliable code. Think of it as an auto-merging CI/CD pipeline with no tests.
HuggingGPTToolBench
SKILL-4-LLMMetaGPT
OpenAI GPT StoreClawHub
When skills are distributed via a marketplace (P7), the threat surface explodes. Unlike P1–P6 where skills are authored in-house or by the agent itself, P7 skills come from unknown third parties. A single malicious skill can be installed by thousands of users before detection. The ClawHavoc attack (Section 7) demonstrated this exactly: 1,184 malicious skills reached 36.8% of active users through a marketplace. The combination of reach (thousands of users) and privilege (agent-level execution) makes P7 the highest-risk distribution pattern in the taxonomy.
Key insight: Code-as-skill (P2) offers the best engineering trade-offs — high determinism, high composability, and low context cost — but requires sandboxing. Marketplace distribution (P7) maximizes scale but introduces the highest supply-chain risk, as demonstrated by ClawHavoc.
Orthogonal to the seven design patterns, skills can also be classified by what they are (representation) and what environments they operate over (scope). This two-dimensional taxonomy reveals the coverage gaps in current research.
Procedural instructions written in natural language (playbooks, recipes). Easy to author and understand by humans. Low determinism — execution depends on the LLM interpreter. Dominant in early agentic systems.
Executable scripts (Python, JavaScript) with deterministic behavior. High composability — can be unit-tested, version-controlled, and formally verified. Requires sandboxed execution environment to mitigate code injection risks.
Neural network policies or RL-trained controllers. Highly adaptive to distribution shifts but difficult to inspect or audit. Primarily used in robotics and embodied AI settings where discrete NL instructions are insufficient.
Combinations of NL instructions + executable code + optional learned components. Provides flexibility while maintaining some degree of auditability. Most production systems converge toward hybrid representations.
Browser navigation, web scraping, form interaction. Well-studied domain with benchmarks like WebArena and Mind2Web. Primary attack surface for confused deputy attacks.
File system, process management, GUI automation. Covered by benchmarks like OSWorld. High-privilege environment — skills operating here require strict permission boundaries.
Code generation, debugging, testing, repository management. SWE-bench and SWE-agent are key benchmarks. Code skills show +4.5pp improvement from curated skills in SkillsBench (lowest domain gain).
Physical robot control, navigation, manipulation. Primarily policy-based skills. SayCan and NavCat are representative systems. Evaluation is challenging due to physical world variance.
Coordination across multiple collaborating or competing agents. Meta-skills (P6) and marketplace patterns (P7) are most relevant here. Cross-tenant skill access introduces additional governance complexity.
⚠ Governance Gap: The comprehensive survey (Table V) reveals that most academic systems lack explicit governance mechanisms. The governance column is predominantly empty — a critical gap that the paper identifies as the most urgent open challenge for production-grade skill-based agents.
The attack exploited the trust that marketplace users place in published skills:
The attack is analogous to a malicious npm package — except the payload isn't code injection, it's prompt injection, which is far harder to detect statically.
Crafting skill metadata to cause retrieval mechanisms to surface malicious skills in response to benign queries — analogous to SEO poisoning. Exploits Pattern-1 (metadata-driven disclosure).
A skill's policy π contains instructions or code that perform unauthorized actions when executed. In code skills (P2), this resembles traditional software supply-chain attacks. In NL skills (P5), the payload is a form of prompt injection.
In multi-agent or multi-user deployments with shared skill repositories, skills authored by one tenant may access data or resources belonging to another. Critical risk in enterprise deployments.
Skills safe at authoring time may become unsafe as the environment evolves. An attacker controlling part of the environment (e.g., a web page a skill navigates) can manipulate behavior without modifying the skill code.
Untrusted observations (web pages, user documents) contain adversarial instructions that coerce the agent into misusing an otherwise benign privileged skill. The skill itself is uncompromised; the attack exploits the agent's instruction-following.
Manipulating input to C such that a malicious skill returns C(o,g) = 1 universally, activating in contexts where it should not. Enables malicious skills to activate across broad task categories, maximizing attack surface.
The agent sees only the skill name and description. No instructions or code are loaded. Supports discovery without execution risk. Safe for all untrusted skills.
The agent loads the skill's natural-language instructions into its context window. Requires read-only mode enforcement during loading. Prompt injection risk at this tier must be mitigated by architectural isolation.
The skill can execute actions (tool calls, code execution) but each action requires user approval or runs within a constrained sandbox. Appropriate for skills from verified but not fully trusted sources.
The skill executes without per-action approval, subject to pre-configured permission boundaries and monitoring. Reserved for fully vetted, provenance-verified skills with demonstrated track records at T3.
The four tiers map to concrete deployment decisions:
The ClawHavoc campaign against OpenClaw's ClawHub skill registry provides the first large-scale empirical evidence of skill supply-chain exploitation. Within weeks of ClawHub's launch, security researchers identified the attack — concretizing every threat category in the paper's model.
The primary payload (Atomic macOS Stealer / AMOS) systematically harvested LLM API keys from .env files, cryptocurrency wallet keys across 60+ wallet types, browser credentials, and session tokens — enabling billing fraud, model abuse, and financial theft at scale.
Attackers cloned popular legitimate skills under near-identical names, exploiting Pattern-1's metadata-driven discovery to rank malicious versions alongside or above originals.
Skills included reverse shells, credential-exfiltration webhooks, and social-engineering 'setup' instructions telling users to run curl | bash pipelines — exploiting P2's code execution trust.
Prompt injection in skill documentation coerced the agent into executing malicious commands using its legitimate tool access — bypassing any skill-level trust check.
Overbroad skill descriptions ensured malicious skills activated across broad task categories (crypto, productivity, automation), maximizing attack surface through P1 metadata manipulation.
How skills enter the library matters as much as how they are executed. The acquisition method shapes the skill's quality, generalizability, and governance properties.
Skills written directly by human developers or domain experts. Highest quality and trustworthiness. Standard in enterprise deployments. Scalability is the primary limitation.
Skills extracted from human or expert agent demonstrations via trajectory distillation. Balances scalability and quality. Key challenge: ensuring the extracted skill generalizes beyond the demonstrated contexts.
Agent autonomously discovers and creates skills through environment interaction. Highest scalability but lowest reliability. Benchmark evidence shows self-generated skills may degrade performance — systematic verification at admission is essential.
When a skill's policy π selects another skill from library Σ instead of a primitive action, hierarchical composition arises — mirroring the option-subroutine structure in RL options framework. This enables complex long-horizon task completion from simpler atomic skills.
The failure recovery property is critical: when a sub-skill fails (T_γ = failure), control returns to the parent with sufficient context to either retry with a different skill or escalate to human oversight. Fig. 4 shows this flow in the 'Deploy Web App' example.
Vector similarity search over skill descriptions. Fast and scalable. May miss semantically-similar but lexically-dissimilar skills. Standard approach in most deployed systems.
LLM reads task description and skill metadata to make a routing decision. Higher accuracy for nuanced disambiguation. Slower and more expensive. Best for critical or high-stakes skill selection.
The paper proposes a five-dimensional evaluation framework and maps existing benchmarks to measurable skill properties. Key finding: no single benchmark covers all dimensions — comprehensive evaluation requires combining multiple benchmarks.
Does the skill achieve its intended outcome? Evaluated via ground-truth annotations or deterministic verifiers. For code skills, unit tests provide direct verification. For web interaction skills, environment state comparison checks goal completion.
Does the skill maintain performance under input variations, environment perturbations, and edge cases? A robust skill handles both legacy and updated UI layouts, for example. Critical for production deployment longevity.
Token consumption, wall-clock time, tool call count, and API costs. Efficiency directly affects deployment cost and composability — inefficient sub-skills slow downstream workflows. Especially important for long-horizon tasks.
Does the skill transfer to unseen tasks or domains? Out-of-distribution evaluation is challenging. Cross-website generalization (Mind2Web) and cross-application evaluation (OSWorld) provide partial evidence. Self-generated skills often fail here.
Does the skill avoid harmful actions, respect permission boundaries, and handle failures gracefully? Evaluated via adversarial testing, red-teaming, and runtime monitoring for unauthorized or unsafe behaviors. Directly links to the trust tier model.
SkillsBench (86 tasks, 7,308 trajectories) provides the most direct evidence to date for the value of curated skills. The study tests curated vs. self-generated skills across multiple domains, revealing dramatic performance differences.
Contextualizing +16.2pp: A 16.2 percentage point improvement means the overall pass rate jumped from 24.3% to 40.6% — a roughly 67% relative gain. In agent benchmarks, most state-of-the-art advances are 2–5pp; +16pp is exceptional. The +51.9pp improvement in the healthcare domain is even more striking: in high-stakes domains, the gap between a curated skill ("verify dosage against formulary before recommending") and a self-generated one can be the difference between a correct answer and a dangerous one.
Skill-based agents expose several unresolved tensions that limit reliable deployment at scale. Five research directions stand out as most urgent.
Automatically generated skills can degrade performance. The key obstacle is no longer skill generation itself, but verification at admission. Skills should be treated like software artifacts in CI/CD pipelines — evaluated against test suites before entering the library.
Most systems still rely on predefined curricula or explicit reward signals. Open-ended capability growth requires adapting unsupervised skill discovery from RL to LLM-based agents — allowing reusable behaviors to emerge from interaction traces without human scaffolding.
Code skills benefit from decades of software assurance. NL and policy skills lack equivalent verification tools. A practical governance challenge: combining rule-based analysis for executable components with semantic inspection for language-based skills.
Even correctly implemented skills may fail as APIs, tools, and workflows evolve. Proactive drift detection through continuous monitoring of execution statistics and deviation from historical behavior is largely absent from current systems.
Marketplace ecosystems create incentives for contribution but expand supply-chain attack surfaces. Liability models must clarify responsibility among skill authors, platform operators, and users. Certification mechanisms should reward dependable skills and discourage risky ones.
Agentic skills are reusable procedural modules that extend LLM agents beyond single-turn tool use toward reliable long-horizon task execution. This SoK paper offers six contributions:
The field faces open challenges in unsupervised discovery, cross-representation verification, drift detection, and governance economics. Progress toward robust, verifiable, and certifiable skills will determine whether skill-based agents can be trusted in high-stakes real-world deployments.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.