Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Deep Dive: Four Key Analyses

Trajectory-Opaque Judges Miss 44% of Safety Violations

When a standard LLM judge (Gemini-3-Flash) was given the full conversation history and final output but not the execution traces, it missed 44% of safety violations (12 out of 27) and 13% of robustness failures (15 out of 118). The hybrid grading pipeline, which incorporates execution traces, audit logs, and environment snapshots, caught every single one.

This finding is striking because the vanilla judge had access to the conversation history, not just the final answer. The problem is that many safety violations occur in intermediate tool calls that are invisible in the conversation transcript.

Figure 3a: Safety violations — **Figure 3a:** Safety violation detection. The hybrid pipeline detected all 27 violations, while the vanilla judge missed 12 (44%).

Figure 3b: Robustness violations — **Figure 3b:** Robustness violation detection. 15 out of 118 violations (13%) were missed by the vanilla judge.

Error Injection Erodes Consistency, Not Peak Capability

When tool calls intermittently fail (simulating real-world API instability), an interesting pattern emerges: Pass@3 remains relatively stable while Pass³ drops dramatically. At a 60% error injection rate, the gap between Pass@3 and Pass³ reaches 42% for Gemini 3.1 Pro.

This means models can still solve tasks on their best attempt, but they struggle to do so consistently. Claude Opus 4.6 shows the highest resilience, with the smallest gap (21%) even at the highest error rate. This highlights that consistency, not just peak capability, should be a primary evaluation criterion.

Figure 4a: Error injection pass rates — **Figure 4a:** Pass@3 (solid) stays stable while Pass³ (dashed) drops with increasing error injection rate.

Figure 4b: Gap growth — **Figure 4b:** The gap between Pass@3 and Pass³ widens as error rate increases, revealing consistency degradation.

Better Questions, Not More, Yield Better Performance

In multi-turn professional dialogue tasks, models must elicit critical information from simulated users through clarifying questions. A surprising finding: the number of questions asked has virtually no correlation with performance (r = 0.07).

In contrast, question precision (measuring how targeted and trajectory-relevant the questions are) shows a very strong correlation (r = 0.87, R² = 0.76). The best-performing models ask fewer but more precise questions, efficiently zeroing in on the information they need.

Figure 5a: Rounds vs Pass^3 — **Figure 5a:** Average dialogue rounds vs. Pass³. No correlation (r = 0.07) between quantity of questions and performance.

Figure 5b: Precision vs Pass^3 — **Figure 5b:** Question Precision vs. Pass³. Very strong correlation (r = 0.87) showing quality matters far more than quantity.

Multimodal Capability Is Domain-Specific

Across 101 multimodal tasks spanning Video, Document & Image, and Code domains, no single model dominates. Claude Opus 4.6 leads in Video (11.5% Pass³), GPT 5.4 leads in Document & Image (54.5%), and Claude Sonnet 4.6 leads in Code (33.3%).

Video tasks are the hardest, with a conversion ratio of only 0.37 (meaning only 37% of tasks that a model can solve on its best attempt are solved consistently). This suggests that domain-targeted training, rather than uniform scaling, is needed to improve multimodal agent capability.

Table 6: Domain-specific results — **Table 6:** Pass³ by model and multimodal domain. Each domain has a different leader.

Figure 6: Domain comparison — **Figure 6:** Aggregated Pass@3 and Pass³ across multimodal domains. Video is hardest (r = 0.37), Document & Image has highest consistency (r = 0.53).

References (49 citations)

[1] Z. AI. Glm-5v-turbo. https://docs.z.ai/guides/vlm/glm-5v-turbo, 2026.
[2] Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025.
[3] Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026.
[4] Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026.
[5] A. Blakeman, A. Grattafiori, A. Basant, A. Gupta, A. Khattar, A. Renduchintala, A. Vavre, A. Shukla, A. Bercovich, A. Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025.
[6] G. DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, 2025.
[7] G. DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, 2026.
[8] S. Ding, X. Dai, L. Xing, S. Ding, Z. Liu, J. Yang, P. Yang, Z. Zhang, X. Wei, Y. Ma, H. Duan, J. Shao, J. Wang, D. Lin, K. Chen, and Y. Zang. Wildclawbench. https://github.com/InternLM/WildClawBench, 2026. GitHub repository.
[9] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66.
[10] Kilo AI team. Pinchbench, 2026. URL https://github.com/pinchbench/skill. Bench- marking system for evaluating LLM models as OpenClaw agents.
[11] J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P .- Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881-905, 2024.
[12] J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y. Cao, Y. Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution. arXiv preprint arXiv:2510.25726, 2025.
[13] M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y. Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 3102-3116, 2023.
[14] R. Li, L. Li, S. Ren, H. Tian, S. Gu, S. Li, Z. Yue, Y. Wang, W. Ma, Z. Yang, et al. Groundingme: Exposing the visual grounding gap in mllms through multi-dimensional evaluation. arXiv preprint arXiv:2512.17495, 2025.
[15] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.
[16] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating Ilms as agents. In The Twelfth International Conference on Learning Representations.
[17] M. MacDiarmid, B. Wright, J. Uesato, J. Benton, J. Kutasov, S. Price, N. Bouscal, S. Bowman, T. Bricken, A. Cloud, et al. Natural emergent misalignment from reward hacking in production rl. arXiv preprint arXiv:2511.18397, 2025.
[18] M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868, 2026.
[19] G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023.
[20] X. MiMo. Xiaomi mimo-v2-omni. https://mimo.xiaomi.com/mimo-v2-omni, 2026.
[21] X. MiMo. Xiaomi mimo-v2-pro. https://mimo.xiaomi.com/mimo-v2-pro, 2026.
[22] MiniMax. Minimax m2.7. https://www.minimax.io/models/text/m27, 2026.
[23] OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4, 2026.
[24] OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. GitHub repository.
[25] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
[26] B. Qiao, L. Li, X. Zhang, S. He, Y. Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
[27] Y. Ruan, H. Dong, A. Wang, S. Pitis, Y. Zhou, J. Ba, Y. Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of Im agents with an Im-emulated sandbox. arXiv preprint arXiv:2309.15817, 2023.
[28] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539-68551, 2023.
[29] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154-38180, 2023.
[30] Q. Sun, M. Li, Z. Liu, Z. Xie, F. Xu, Z. Yin, K. Cheng, Z. Li, Z. Ding, Q. Liu, et al. Os-sentinel: Towards safety-enhanced mobile gui agents via hybrid validation in realistic workflows. arXiv preprint arXiv:2510.24411, 2025.
[31] E. B. Sydney Von Arx, Lawrence Chan. Recent frontier models are reward hacking. https: //metr.org/blog/2025-06-05-recent-reward-hacking/, 06 2025.
[32] K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.
[33] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji. Mint: Evaluating llms in multi- turn interaction with tools and language feedback. In The Twelfth International Conference on Learning Representations.
[34] T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040-52094, 2024.
[35] T. Xie, M. Yuan, D. Zhang, X. Xiong, Z. Shen, Z. Zhou, X. Wang, Y. Chen, J. Deng, J. Chen, B. Wang, H. Wu, J. Chen, J. Wang, D. Lu, H. Hu, and T. Yu. Introducing osworld-verified. xlang.ai, Jul 2025. URL https://xlang.ai/blog/osworld-verified.
[36] T. Xiong, Y. Ge, M. Li, Z. Zhang, P. Kulkarni, K. Wang, Q. He, Z. Zhu, C. Liu, R. Chen, et al. Multi-crit: Benchmarking multimodal judges on pluralistic criteria-following. arXiv preprint arXiv:2511.21662, 2025.
[37] T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li. Llava-critic: Learning to evaluate multimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13618-13628, 2025.
[38] T. Xiong, S. Wang, G. Liu, Y. Dong, M. Li, H. Huang, J. Kautz, and Z. Yu. Phycritic: Multimodal critic models for physical ai. arXiv preprint arXiv:2602.11124, 2026.
[39] W. Xiong, Y. Song, X. Zhao, W. Wu, X. Wang, K. Wang, C. Li, W. Peng, and S. Li. Watch every step! llm agent learning via iterative step-level process refinement. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1556-1572, 2024.
[40] W. Xiong, Y. Song, Q. Dong, B. Zhao, F. Song, X. Wang, and S. Li. Mpo: Boosting llm agents with meta plan optimization. arXiv preprint arXiv:2503.02682, 5(6):7, 2025.
[41] F. F. Xu, Y. Song, B. Li, Y. Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, et al. Theagentcompany: Benchmarking llm agents on consequential real world tasks. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[42] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
[43] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.
[44] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022.
[45] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. T-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/2406.12045.
[46] T. Yuan, Z. He, L. Dong, Y. Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhang, et al. R-judge: Benchmarking safety risk awareness for Ilm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1467-1490, 2024.
[47] A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026.
[48] Z. Zhang, S. Cui, Y. Lu, J. Zhou, J. Yang, H. Wang, and M. Huang. Agent-safetybench: Evaluating the safety of Ilm agents. arXiv preprint arXiv:2412.14470, 2024.
[49] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2):1-124, 2023.

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Key Findings at a Glance

The Problem: Why Current Benchmarks Fall Short

How Claw-Eval Works

Auditable Execution Pipeline

Cross-Modal Task Suite

Scoring Protocol

Evaluation Results

Deep Dive: Four Key Analyses

Trajectory-Opaque Judges Miss 44% of Safety Violations

Error Injection Erodes Consistency, Not Peak Capability

Better Questions, Not More, Yield Better Performance

Multimodal Capability Is Domain-Specific

Conclusions