Research Paper
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. Claw-Eval is an end-to-end evaluation suite addressing all three gaps with 300 human-verified tasks, trajectory-aware grading over 2,159 fine-grained rubric items, and experiments on 14 frontier models.
of safety violations missed by trajectory-opaque evaluation methods
Pass3 drop from controlled error injection, revealing consistency gaps
frontier models evaluated across 300 tasks spanning 9 categories
Large language models have rapidly evolved from conversational assistants into autonomous agents capable of executing complex, multi-step workflows in real-world software environments. Modern agent harnesses like Claude Code and OpenClaw can write code, manage files, browse the web, and orchestrate multi-service workflows with minimal human intervention.
Yet existing benchmarks have three critical gaps that limit their diagnostic power:
Trajectory-Opaque Grading: Most benchmarks check only the final output, ignoring how the agent got there. An agent that stumbles through unsafe intermediate steps but produces a correct final answer gets a passing grade.
Underspecified Safety Evaluation: Safety and robustness are tested in narrow, isolated settings rather than as integral dimensions of real-world task completion.
Narrow Modality Coverage: Most suites focus on a single modality (text-only tool use, or GUI interaction) and ignore the multi-modal, multi-turn scenarios agents face in practice.
Claw-Eval addresses all three gaps within a unified platform, organized around three corresponding design principles.
Every agent action is recorded through three independent evidence channels: execution traces (the full sequence of tool calls and their results), audit logs (system-level records of file changes, network requests, and process spawning), and environment snapshots (periodic captures of the sandbox state). This enables trajectory-aware grading over 2,159 fine-grained rubric items.
300 human-verified tasks spanning 9 categories across three groups: general service orchestration (Easy, Medium, Hard), multimodal perception and generation (Video, Document & Image, Code), and multi-turn professional dialogue (STEM, Social Science, Business). Each task comes with workspace files, mock services, and detailed rubrics.
The scoring protocol evaluates three orthogonal dimensions: Completion (did the agent fulfill the task?), Safety (did it avoid harmful actions?), and Robustness (did it handle edge cases gracefully?). Results are reported as Average Score, Pass@k (best of k trials), and Passk (worst of k trials) to distinguish genuine capability from lucky outcomes.
Experiments were conducted on 14 frontier models spanning seven model families. Each model was evaluated three times per task to compute both Pass@3 (best-of-three, measuring peak capability) and Pass3 (worst-of-three, measuring consistency).
When a standard LLM judge (Gemini-3-Flash) was given the full conversation history and final output but not the execution traces, it missed 44% of safety violations (12 out of 27) and 13% of robustness failures (15 out of 118). The hybrid grading pipeline, which incorporates execution traces, audit logs, and environment snapshots, caught every single one.
This finding is striking because the vanilla judge had access to the conversation history, not just the final answer. The problem is that many safety violations occur in intermediate tool calls that are invisible in the conversation transcript.
When tool calls intermittently fail (simulating real-world API instability), an interesting pattern emerges: Pass@3 remains relatively stable while Pass3 drops dramatically. At a 60% error injection rate, the gap between Pass@3 and Pass3 reaches 42% for Gemini 3.1 Pro.
This means models can still solve tasks on their best attempt, but they struggle to do so consistently. Claude Opus 4.6 shows the highest resilience, with the smallest gap (21%) even at the highest error rate. This highlights that consistency, not just peak capability, should be a primary evaluation criterion.
In multi-turn professional dialogue tasks, models must elicit critical information from simulated users through clarifying questions. A surprising finding: the number of questions asked has virtually no correlation with performance (r = 0.07).
In contrast, question precision (measuring how targeted and trajectory-relevant the questions are) shows a very strong correlation (r = 0.87, R² = 0.76). The best-performing models ask fewer but more precise questions, efficiently zeroing in on the information they need.
Across 101 multimodal tasks spanning Video, Document & Image, and Code domains, no single model dominates. Claude Opus 4.6 leads in Video (11.5% Pass3), GPT 5.4 leads in Document & Image (54.5%), and Claude Sonnet 4.6 leads in Code (33.3%).
Video tasks are the hardest, with a conversion ratio of only 0.37 (meaning only 37% of tasks that a model can solve on its best attempt are solved consistently). This suggests that domain-targeted training, rather than uniform scaling, is needed to improve multimodal agent capability.
Claw-Eval includes diverse task types. Below is an example of a multimodal task where an agent must reconstruct a floor plan from a room walkthrough video.
Claw-Eval is a transparent evaluation suite for LLM-based agents that combines full trajectory auditing, cross-modal task coverage, and controlled perturbation mechanisms to assess whether agents are not only capable but reliably deployable.
The experiments reveal four actionable directions for agent development:
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.