← Flecto

Research Paper

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. Claw-Eval is an end-to-end evaluation suite addressing all three gaps with 300 human-verified tasks, trajectory-aware grading over 2,159 fine-grained rubric items, and experiments on 14 frontier models.

Key Findings at a Glance

44%

of safety violations missed by trajectory-opaque evaluation methods

24%

Pass3 drop from controlled error injection, revealing consistency gaps

14

frontier models evaluated across 300 tasks spanning 9 categories

The Problem: Why Current Benchmarks Fall Short

Large language models have rapidly evolved from conversational assistants into autonomous agents capable of executing complex, multi-step workflows in real-world software environments. Modern agent harnesses like Claude Code and OpenClaw can write code, manage files, browse the web, and orchestrate multi-service workflows with minimal human intervention.

Yet existing benchmarks have three critical gaps that limit their diagnostic power:

Gap 1

Trajectory-Opaque Grading: Most benchmarks check only the final output, ignoring how the agent got there. An agent that stumbles through unsafe intermediate steps but produces a correct final answer gets a passing grade.

Gap 2

Underspecified Safety Evaluation: Safety and robustness are tested in narrow, isolated settings rather than as integral dimensions of real-world task completion.

Gap 3

Narrow Modality Coverage: Most suites focus on a single modality (text-only tool use, or GUI interaction) and ignore the multi-modal, multi-turn scenarios agents face in practice.

Claw-Eval addresses all three gaps within a unified platform, organized around three corresponding design principles.

Table 1: Benchmark comparison
Table 1: Feature comparison of existing agent benchmarks. Claw-Eval is the only suite supporting all six evaluation axes: trajectory auditing, multimodal tasks, safety, robustness, multi-turn dialogue, and cross-modal coverage.

How Claw-Eval Works

Figure 1: Claw-Eval Architecture
Figure 1: The Claw-Eval architecture. In the Setup phase, task definitions and workspace files are provisioned into a sandbox. During Execution, agent actions are recorded through three independent evidence channels. In the Judge phase, all evidence is combined for multi-dimensional scoring.
🔍

Auditable Execution Pipeline

Every agent action is recorded through three independent evidence channels: execution traces (the full sequence of tool calls and their results), audit logs (system-level records of file changes, network requests, and process spawning), and environment snapshots (periodic captures of the sandbox state). This enables trajectory-aware grading over 2,159 fine-grained rubric items.

📊

Cross-Modal Task Suite

300 human-verified tasks spanning 9 categories across three groups: general service orchestration (Easy, Medium, Hard), multimodal perception and generation (Video, Document & Image, Code), and multi-turn professional dialogue (STEM, Social Science, Business). Each task comes with workspace files, mock services, and detailed rubrics.

Scoring Protocol

The scoring protocol evaluates three orthogonal dimensions: Completion (did the agent fulfill the task?), Safety (did it avoid harmful actions?), and Robustness (did it handle edge cases gracefully?). Results are reported as Average Score, Pass@k (best of k trials), and Passk (worst of k trials) to distinguish genuine capability from lucky outcomes.

Table 3: Task Distribution
Table 3: Distribution of 300 tasks across categories and difficulty levels.

Evaluation Results

Experiments were conducted on 14 frontier models spanning seven model families. Each model was evaluated three times per task to compute both Pass@3 (best-of-three, measuring peak capability) and Pass3 (worst-of-three, measuring consistency).

Table 4: Main Results
Table 4: Main results on General and Multi-turn tasks. Claude Opus 4.6 leads in consistency (Pass3 = 70.4%), while Claude Sonnet 4.6 achieves the highest peak score (Pass@3 = 82.9%).
Figure 2: Pass^3 by difficulty
Figure 2: Pass3 by difficulty level on General tasks. All models degrade from Easy to Hard, but the rate of degradation varies significantly. Nemotron 3 Super drops to 0% on Hard tasks.
Table 5: Multimodal Results
Table 5: Multimodal task results. These tasks are substantially harder: the highest Pass3 is only 25.7% (GPT 5.4), far below the 70.8% achieved on General tasks.

Deep Dive: Four Key Analyses

Trajectory-Opaque Judges Miss 44% of Safety Violations

When a standard LLM judge (Gemini-3-Flash) was given the full conversation history and final output but not the execution traces, it missed 44% of safety violations (12 out of 27) and 13% of robustness failures (15 out of 118). The hybrid grading pipeline, which incorporates execution traces, audit logs, and environment snapshots, caught every single one.

This finding is striking because the vanilla judge had access to the conversation history, not just the final answer. The problem is that many safety violations occur in intermediate tool calls that are invisible in the conversation transcript.

Figure 3a: Safety violations
Figure 3a: Safety violation detection. The hybrid pipeline detected all 27 violations, while the vanilla judge missed 12 (44%).
Figure 3b: Robustness violations
Figure 3b: Robustness violation detection. 15 out of 118 violations (13%) were missed by the vanilla judge.

Error Injection Erodes Consistency, Not Peak Capability

When tool calls intermittently fail (simulating real-world API instability), an interesting pattern emerges: Pass@3 remains relatively stable while Pass3 drops dramatically. At a 60% error injection rate, the gap between Pass@3 and Pass3 reaches 42% for Gemini 3.1 Pro.

This means models can still solve tasks on their best attempt, but they struggle to do so consistently. Claude Opus 4.6 shows the highest resilience, with the smallest gap (21%) even at the highest error rate. This highlights that consistency, not just peak capability, should be a primary evaluation criterion.

Figure 4a: Error injection pass rates
Figure 4a: Pass@3 (solid) stays stable while Pass3 (dashed) drops with increasing error injection rate.
Figure 4b: Gap growth
Figure 4b: The gap between Pass@3 and Pass3 widens as error rate increases, revealing consistency degradation.

Better Questions, Not More, Yield Better Performance

In multi-turn professional dialogue tasks, models must elicit critical information from simulated users through clarifying questions. A surprising finding: the number of questions asked has virtually no correlation with performance (r = 0.07).

In contrast, question precision (measuring how targeted and trajectory-relevant the questions are) shows a very strong correlation (r = 0.87, R² = 0.76). The best-performing models ask fewer but more precise questions, efficiently zeroing in on the information they need.

Figure 5a: Rounds vs Pass^3
Figure 5a: Average dialogue rounds vs. Pass3. No correlation (r = 0.07) between quantity of questions and performance.
Figure 5b: Precision vs Pass^3
Figure 5b: Question Precision vs. Pass3. Very strong correlation (r = 0.87) showing quality matters far more than quantity.

Multimodal Capability Is Domain-Specific

Across 101 multimodal tasks spanning Video, Document & Image, and Code domains, no single model dominates. Claude Opus 4.6 leads in Video (11.5% Pass3), GPT 5.4 leads in Document & Image (54.5%), and Claude Sonnet 4.6 leads in Code (33.3%).

Video tasks are the hardest, with a conversion ratio of only 0.37 (meaning only 37% of tasks that a model can solve on its best attempt are solved consistently). This suggests that domain-targeted training, rather than uniform scaling, is needed to improve multimodal agent capability.

Table 6: Domain-specific results
Table 6: Pass3 by model and multimodal domain. Each domain has a different leader.
Figure 6: Domain comparison
Figure 6: Aggregated Pass@3 and Pass3 across multimodal domains. Video is hardest (r = 0.37), Document & Image has highest consistency (r = 0.53).
Case Studies: Example Tasks

Claw-Eval includes diverse task types. Below is an example of a multimodal task where an agent must reconstruct a floor plan from a room walkthrough video.

Input: Room Walkthrough Video Frames
Case study input
Output: Agent-Generated Floor Plan
Case study output

Conclusions

Claw-Eval is a transparent evaluation suite for LLM-based agents that combines full trajectory auditing, cross-modal task coverage, and controlled perturbation mechanisms to assess whether agents are not only capable but reliably deployable.

The experiments reveal four actionable directions for agent development:

References (49 citations)
  1. [1] Z. AI. Glm-5v-turbo. https://docs.z.ai/guides/vlm/glm-5v-turbo, 2026.
  2. [2] Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025.
  3. [3] Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026.
  4. [4] Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026.
  5. [5] A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025.
  6. [6] G. DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, 2025.
  7. [7] G. DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, 2026.
  8. [8] S. Ding, X. Dai, L. Xing, S. Ding, Z. Liu, J. Yang, P. Yang, Z. Zhang, X. Wei, Y. Ma, H. Duan, J. Shao, J. Wang, D. Lin, K. Chen, and Y. Zang. Wildclawbench. https://github.com/InternLM/WildClawBench, 2026. GitHub repository.
  9. [9] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66.
  10. [10] Kilo AI team. Pinchbench, 2026. URL https://github.com/pinchbench/skill. Bench- marking system for evaluating LLM models as OpenClaw agents.
  11. [11] J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P .- Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881-905, 2024.
  12. [12] J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y. Cao, Y. Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution. arXiv preprint arXiv:2510.25726, 2025.
  13. [13] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 3102-3116, 2023.
  14. [14] R. Li, L. Li, S. Ren, H. Tian, S. Gu, S. Li, Z. Yue, Y. Wang, W. Ma, Z. Yang, et al. Groundingme: Exposing the visual grounding gap in mllms through multi-dimensional evaluation. arXiv preprint arXiv:2512.17495, 2025.
  15. [15] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
  16. [16] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating Ilms as agents. In The Twelfth International Conference on Learning Representations.
  17. [17] M. MacDiarmid, B. Wright, J. Uesato, J. Benton, J. Kutasov, S. Price, N. Bouscal, S. Bowman, T. Bricken, A. Cloud, et al. Natural emergent misalignment from reward hacking in production rl. arXiv preprint arXiv:2511.18397, 2025.
  18. [18] M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868, 2026.
  19. [19] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023.
  20. [20] X. MiMo. Xiaomi mimo-v2-omni. https://mimo.xiaomi.com/mimo-v2-omni, 2026.
  21. [21] X. MiMo. Xiaomi mimo-v2-pro. https://mimo.xiaomi.com/mimo-v2-pro, 2026.
  22. [22] MiniMax. Minimax m2.7. https://www.minimax.io/models/text/m27, 2026.
  23. [23] OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4, 2026.
  24. [24] OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. GitHub repository.
  25. [25] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  26. [26] B. Qiao, L. Li, X. Zhang, S. He, Y. Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
  27. [27] Y. Ruan, H. Dong, A. Wang, S. Pitis, Y. Zhou, J. Ba, Y. Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of Im agents with an Im-emulated sandbox. arXiv preprint arXiv:2309.15817, 2023.
  28. [28] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539-68551, 2023.
  29. [29] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154-38180, 2023.
  30. [30] Q. Sun, M. Li, Z. Liu, Z. Xie, F. Xu, Z. Yin, K. Cheng, Z. Li, Z. Ding, Q. Liu, et al. Os-sentinel: Towards safety-enhanced mobile gui agents via hybrid validation in realistic workflows. arXiv preprint arXiv:2510.24411, 2025.
  31. [31] E. B. Sydney Von Arx, Lawrence Chan. Recent frontier models are reward hacking. https: //metr.org/blog/2025-06-05-recent-reward-hacking/, 06 2025.
  32. [32] K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.
  33. [33] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji. Mint: Evaluating llms in multi- turn interaction with tools and language feedback. In The Twelfth International Conference on Learning Representations.
  34. [34] T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040-52094, 2024.
  35. [35] T. Xie, M. Yuan, D. Zhang, X. Xiong, Z. Shen, Z. Zhou, X. Wang, Y. Chen, J. Deng, J. Chen, B. Wang, H. Wu, J. Chen, J. Wang, D. Lu, H. Hu, and T. Yu. Introducing osworld-verified. xlang.ai, Jul 2025. URL https://xlang.ai/blog/osworld-verified.
  36. [36] T. Xiong, Y. Ge, M. Li, Z. Zhang, P. Kulkarni, K. Wang, Q. He, Z. Zhu, C. Liu, R. Chen, et al. Multi-crit: Benchmarking multimodal judges on pluralistic criteria-following. arXiv preprint arXiv:2511.21662, 2025.
  37. [37] T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li. Llava-critic: Learning to evaluate multimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13618-13628, 2025.
  38. [38] T. Xiong, S. Wang, G. Liu, Y. Dong, M. Li, H. Huang, J. Kautz, and Z. Yu. Phycritic: Multimodal critic models for physical ai. arXiv preprint arXiv:2602.11124, 2026.
  39. [39] W. Xiong, Y. Song, X. Zhao, W. Wu, X. Wang, K. Wang, C. Li, W. Peng, and S. Li. Watch every step! llm agent learning via iterative step-level process refinement. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1556-1572, 2024.
  40. [40] W. Xiong, Y. Song, Q. Dong, B. Zhao, F. Song, X. Wang, and S. Li. Mpo: Boosting llm agents with meta plan optimization. arXiv preprint arXiv:2503.02682, 5(6):7, 2025.
  41. [41] F. F. Xu, Y. Song, B. Li, Y. Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, et al. Theagentcompany: Benchmarking llm agents on consequential real world tasks. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  42. [42] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
  43. [43] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.
  44. [44] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022.
  45. [45] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. T-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/2406.12045.
  46. [46] T. Yuan, Z. He, L. Dong, Y. Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhang, et al. R-judge: Benchmarking safety risk awareness for Ilm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1467-1490, 2024.
  47. [47] A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026.
  48. [48] Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, and M. Huang. Agent-safetybench: Evaluating the safety of Ilm agents. arXiv preprint arXiv:2412.14470, 2024.
  49. [49] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2):1-124, 2023.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us