1. Introduction
Multimodal Large Language Models are rapidly evolving from passive observers to active investigators. Rather than answering from a static snapshot, modern systems increasingly solve problems by interacting: they manipulate images to surface fine-grained evidence and consult external resources to verify facts not visually present. This shift represents multimodal agentic capability, decomposable into two core dimensions: (1) Visual Expansion, which enables models to think with images by actively transforming and analyzing inputs (e.g., cropping, rotating, enhancing) to uncover latent cues; and (2) Knowledge Expansion, which enables models to go beyond parametric memory via open-ended web search to validate real-world facts and resolve ambiguity.
What are Visual Expansion and Knowledge Expansion?
Think of a detective: Visual Expansion means using a magnifying glass on the image (crop, zoom, rotate, OCR) to surface hidden evidence. Knowledge Expansion means going to the library (web search) to verify facts not present in the image. Agentic-MME measures how well AI combines both โ because real-world complex tasks demand both simultaneously, not sequentially.
While this active paradigm promises to solve complex real-world problems, current evaluation for multimodal agentic capabilities remains fragmented and insufficient. Most existing benchmarks capture specific aspects of tool use but fail in three critical dimensions โ and a true multimodal agent benchmark must address all three simultaneously.
Three Critical Gaps in Prior Benchmarks
- Inflexible tool integration: Current evaluations decouple visual tool use from open-web search, treating them as independent modules. No unified framework allows agents to fluidly select and switch between arbitrary visual and search tools.
- Unexplored synergy: The interplay between Visual and Knowledge Expansion is largely untested. True multimodal agents must excel at "intertwined" tasks that cannot be solved by simple visual expansion or isolated knowledge expansion alone.
- No process verification: Existing evaluations focus on final-answer correctness, offering no insight into whether tools were invoked, applied correctly, or used efficiently. Unfaithful tool execution remains hidden.
Agentic-MME addresses all three gaps with a carefully designed benchmark that includes 418 real-world tasks, a unified execution harness supporting both Code (Gen) and Atomic (Atm) tool interfaces, and over 2,000 human-annotated stepwise checkpoints that enable fine-grained process-level verification.
Why process verification matters
A student who guesses the correct exam answer without doing the work looks identical to one who reasoned correctly โ under final-answer grading. In deployed AI systems (medical imaging, legal analysis), this difference is critical. Agentic-MME's process verification catches this by auditing each intermediate state: Was the right region cropped? Did the search return relevant results? Was the calculation faithful? It's the difference between testing a pilot's flight log versus just checking the landing.
2. Agentic-MME Benchmark
2.1 Overview
Agentic-MME is designed to evaluate multimodal agentic capabilities in realistic scenarios where agents actively utilize visual tools to transform and perceive image content, and โ contingent on task requirements โ coordinate with open-web search to retrieve essential external knowledge. Unlike benchmarks that isolate visual operations or web search, Agentic-MME targets the deep synergy between these two capabilities. The benchmark comparison (Table 1) shows that Agentic-MME is the only benchmark that simultaneously supports heterogeneous tool interfaces, tests tool synergy, enables process verification, measures efficiency, and defines difficulty levels.
Table 1: Comparison with Existing Multimodal Agentic Benchmarks
Comparison across key capabilities and evaluation protocol dimensions. Agentic-MME (bottom row) is the only benchmark covering all dimensions: image tools, search core, process verification, unified code+tool interface, efficiency metric, and difficulty levels.
2.2 Task Setup, Difficulty, and Metrics
Each instance provides one or more images and a question. Agents solve the task by actively manipulating images within a unified tool-augmented interface equipped with 13 distinct visual operations for Visual Expansion and 4 open-web retrieval tools for Knowledge Expansion. Tasks are systematically stratified into three difficulty levels based on the interaction complexity along a reasonable solution path.
Easy โ Single Operation
Requires a single visual operation (e.g., one crop or rotation). Avg 2.89 checkpoints, 1.21 tools per task.
Mid โ Multi-Step Workflow
Requires multi-step workflows combining image manipulation with optional web search. Avg 4.64 checkpoints, 2.42 tools per task.
Hard โ Advanced Synergy
Demands intertwined, multi-round interactions between visual manipulation and web search. Cannot be solved by simple sequential tool chaining. Avg 6.67 checkpoints, 4.07 tools per task.
Table 2: Task difficulty distribution. Level 3 tasks comprise 19.4% of the benchmark and require 4.07 tools and 6.67 checkpoints on average โ significantly more complex than Level 1.
Evaluation Metrics
What are S-axis and V-axis?
Unlike benchmarks that only check final answers, Agentic-MME uses two evaluation axes. S-axis (Strategy & Tool Execution): grades each intermediate tool call โ did the model choose the right tool, apply it correctly, and extract useful evidence? V-axis (Visual Evidence Verification): verifies that visual artifacts the model claims to have found actually exist in the image โ preventing fabricated reasoning chains that accidentally reach correct answers.
2.3 Data Collection and Annotation
The data collection pipeline uses a Backward Drafting approach: rather than writing questions first, annotators start from high-resolution, visually complex images that require visual tools to perceive, then construct multi-step trajectories grounding each step in tool actions and visual ground truth. This ensures tool invocation is necessary, not optional. The pipeline proceeds through four stages: image sourcing, backward drafting, granular annotation, and quality assurance.
Why use Backward Drafting?
Most datasets start with a question and find an image โ accidentally making tool use optional. Backward Drafting reverses this: annotators start from a visually complex image, identify what evidence is buried inside, build the tool trajectory to extract it, then write the question. This guarantees every task genuinely requires tool use. Similar to how escape rooms are designed: build the puzzle first, write the instructions second.
Table 3: Dataset Key Properties
430 images, 899 tools, 6 domains / 35 sub-domains. Average image resolution: 1952ร1747 px. 43.1% of tasks have small visual cues (<10% of image area). 29.4% of tasks require external web search.
2.4 Quality Control and Assurance
Each annotated task undergoes multiple rounds of independent verification. Quality control involves step-wise oracle testing โ systematically probing edge cases and failure modes โ followed by consensus auditing where multiple experts must agree on the ground truth trajectory. Tasks failing to meet consistency thresholds are revised or discarded. This rigorous process ensures the 2,000+ stepwise checkpoints are grounded, reproducible, and faithfully represent human-level reasoning trajectories.
Each task averages more than 10 person-hours of manual annotation โ reflecting the depth of process-level verification required to capture faithful, step-by-step reasoning trajectories.
2.5 Unified Tool Interface and Execution Harness
A central design goal of Agentic-MME is to benchmark agentic capability across heterogeneous tool implementations. The unified execution harness supports two interfaces: Code mode (Gen), where models write sandboxed Python to perform visual transforms, and Atomic mode (Atm), where models interact via structured function calls following OpenAI-compatible JSON schemas. This controlled comparison tests whether tool competence generalizes across interfaces rather than being tied to one training format.
Visual Expansion (13 tools)
Active image manipulation tools that transform images to surface hidden evidence, extract fine-grained details, or apply spatial transformations.
Knowledge Expansion (4 tools)
Open-web retrieval tools that extend beyond parametric memory to verify real-world facts and resolve ambiguity through external search.
3. Experiments
3.1 Experimental Setup
We evaluate a diverse set of models on Agentic-MME, including open-source models (Thyme-rl, DeepeEyes-V2, Qwen3-VL-235B, Qwen3-VL-8B-thinking, Qwen3-VL-32B-thinking) and closed-source models (Gemini 3 family, Kimi-k2.5, GPT-5.2, Qwen3.5-plus). A human reference baseline is obtained by averaging three independent human solvers allowed to use search engines and perception tools.
Each model is evaluated under both tool interfaces: Code mode (Gen) for sandboxed Python execution and Atomic mode (Atm) for structured function calls. This controlled comparison directly tests whether tool competence generalizes across interfaces. All evaluations run on fully logged, replayable traces, with GPT-5-mini as the primary judge โ validated by human experts showing consistent results across judge choices (Table 8).
3.2 Main Results on Agentic-MME
All models fall far below human performance, with a sharp accuracy drop on Level-3. Human solvers reach 93.8% overall and remain strong on the hardest split (L3: 82.3%). The best model, Gemini 3 Pro (Atm), achieves 56.3% overall โ but only 33.3% on Level-3. Without tools, Gemini 3 Pro drops to 7.5% on L3; tool access provides a 4.4ร improvement to 33.3%, yet the gap to human (82.3%) remains vast.
Open-source models lag behind closed-source, primarily in search and planning. The gap is most visible on Level-3: Qwen3 VL-235B drops to 10.1% and Thyme-rl collapses to 2.5%. The S-axis reveals the mechanism โ current open-source models can invoke tools but have not yet acquired the retrieval and planning sophistication needed to chain multi-step workflows reliably.
Atomic (Atm) mode generally improves accuracy over Code (Gen) mode across models. This suggests that structured function-call interfaces reduce implementation errors and provide clearer boundaries for tool use, enabling more reliable stepwise execution.
Table 4: Main Results on Agentic-MME
Results for all evaluated models in Gen and Atm modes across Overall, Level 1 (L1), Level 2 (L2), and Level 3 (L3). Metrics: Acc = accuracy, S = S-axis score, V = V-axis validity, VIT/VFT = intent/fidelity tracking. Human achieves 93.8% overall vs best model 56.3%.
3.3 Further Analysis
We conduct two analyses to understand the sources of performance gaps: (1) an ablation study isolating the contribution of each tool category, and (2) an upper-bound analysis providing visual cues and stepwise guidance to quantify the potential improvement achievable with better tool execution.
Table 5: Tool Ablation Study
Ablation results for Gemini 3 Flash and Qwen3 VL-235B. Settings: Perception-only (no tools), Image-only (visual tools only), Search-only (web search only), Full (both). Full integration achieves the best performance โ visual and search tools are complementary, not redundant.
Table 6: Upper-Bound Analysis
Performance when providing visual cues (+Visual Cues) and stepwise guidance (+Stepwise Guidance). Stepwise guidance boosts Gemini 3 Flash from 52.24% to 76.21% โ showing the ceiling is achievable with better planning. The gap between current autonomous performance and guided performance represents the frontier for agentic reasoning improvement.
Adding stepwise guidance boosts performance from 52.24% to 76.21% โ a 24-point jump that shows current models fail primarily on planning and execution reliability, not fundamental perceptual capability.
3.4 Fine-Grained Error Analysis
To understand how models fail, we conduct fine-grained error analysis across all three difficulty levels. The heatmap (Figure 4) shows error category distribution for L1, L2, L3, and Overall, revealing that error patterns change significantly with difficulty. L3 tasks show much higher rates of multi-hop reasoning failures, search integration errors, and visual ambiguity mismanagement โ the core challenges unique to advanced synergistic workflows.
Table 7: Tool Call Efficiency (Calls & Overthinking)
Average tool calls and Overthinking (OT) score per model in Gen and Atm modes. GPT-5-mini makes the most tool calls (Gen: 12.13, Atm: 7.22). High OT indicates redundant tool usage relative to human reference trajectories.
What is the Overthinking (OT) metric?
An AI calling 12 tools when a human needs 3 is inefficient and costly in production. The OT metric measures excess tool calls relative to the human reference trajectory. Low OT = efficient, focused tool use. High OT = the model is "spinning" โ repeatedly querying without converging. GPT-5-mini shows the highest OT (Gen: 12.13 calls), suggesting it struggles to decide when sufficient evidence has been gathered.
Table 8: Judge Consistency Validation
Evaluation results using different judges (GPT-5-mini, Gemini-2.5-Flash, GPT-4o-mini, Human Expert). All judges produce identical Acc (56.28), confirming evaluation stability regardless of judge choice.
5. Conclusion
We introduce Agentic-MME, a process-verified benchmark designed to systematically evaluate the deep synergy between active visual manipulation (Visual Expansion) and open-web retrieval (Knowledge Expansion) in multimodal agents. Moving beyond opaque final-answer grading, we contribute a unified execution harness supporting heterogeneous tool interfaces, grounded in over 2,000 human-annotated stepwise checkpoints. This dual-axis framework enables granular auditing of intermediate tool intent, visual artifact faithfulness, and execution efficiency.
Our evaluation exposes a critical gap between frontier models and human performance, particularly in complex workflows. While current models can execute simple sequential tool-chaining, they severely struggle with advanced synergistic tasks โ resolving visual ambiguity through fuzzy search, conducting iterative hypothesis verification across modalities. By pinpointing these bottlenecks โ unfaithful tool execution and redundant "overthinking" loops โ Agentic-MME provides a rigorous, diagnostic roadmap for developing robust, long-horizon multimodal agents.
Key Takeaway for Practitioners
If you're building multimodal AI systems, Agentic-MME reveals that the bottleneck is not perception โ it's planning under ambiguity and faithful multi-tool orchestration. Models that score well on static benchmarks can still collapse on Level-3 tasks requiring iterative visual search and cross-modal verification. Agentic-MME is the diagnostic tool to identify and fix these gaps.
References (39)
- Bai, J., et al. (2025). Qwen3-VL Technical Report.
- Chen, J., et al. (2025). Knowledge Expansion in Multimodal Agents.
- Deng, G., et al. (2023). Mind2Web: Towards a Generalist Agent for the Web. NeurIPS.
- Froger, A., et al. (2026). Process-Verified Evaluation for Agentic AI. ICLR.
- Fu, C., et al. (2023). MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
- Guo, T., et al. (2025). VisualBench: Systematic Evaluation of Visual Tool Use. CVPR.
- Hong, M., et al. (2025). DeepeEyes: Deep Visual Reasoning in MLLMs. ICLR.
- Hou, X., et al. (2025). CodeV: Process-Level Evaluation of Code-Augmented Visual Agents. ICML.
- Huang, T., et al. (2026a). Deep Research with Multimodal Agents. NAACL.
- Huang, T., et al. (2026b). Vision-in-the-Loop Web Research. ACL.
- Jiang, Z., et al. (2024). Open-World Information Seeking with Multimodal Agents.
- Kimi Team. (2026). Kimi-k2.5: Frontier Multimodal Reasoning. Technical Report.
- Lai, S., et al. (2025). Active Visual Manipulation for MLLMs. CVPR.
- Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training. ICML.
- Li, P., et al. (2025a). GTA: Tool-Augmented Multimodal Benchmark. NeurIPS.
- Li, W., et al. (2025b). MM-BrowseComp: Browser-based Multimodal Completion. ACL.
- Ma, Y., et al. (2024). VisualAgent: Multi-tool Visual Reasoning. ECCV.
- Narayan, A., et al. (2025). Fact Verification via Multimodal Search. SIGIR.
- OpenAI. (2025). GPT-5.2 Technical Report.
- Shen, Y., et al. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends.
- Shi, K., et al. (2025a). Vision-Language Agents for Scientific Discovery. Nature MI.
- Shi, K., et al. (2025b). MMSPHy-Plus: Physical Scene Understanding Benchmark. CVPR.
- Shi, K., et al. (2025c). Visual Tool Selection and Execution. ICLR.
- Su, D., et al. (2025). Image Transformation for Evidence Extraction.
- Tao, R., et al. (2025). MMSearch: Multimodal Web Search Evaluation. ACL.
- Team, G., et al. (2023). Gemini: A Family of Highly Capable Multimodal Models.
- Team, G., et al. (2026). Gemini 3: Frontier Multimodal Model. Technical Report.
- Team, Q. (2026). Qwen3.5-plus Technical Report.
- Wang, J., et al. (2024). ToolBench: Benchmarking Tool Use in Language Models. NeurIPS.
- Wang, L., et al. (2025). Agentic Visual Perception Systems. CVPR.
- Wei, Q., et al. (2026). Visual Expansion for Multimodal Agents.
- Yu, W., et al. (2023). MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities.
- Yue, X., et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding. CVPR.
- Zeng, Z., et al. (2026). Knowledge Retrieval in Multimodal Settings. ACL.
- Zhang, M., et al. (2025a). Multimodal Scientific Agents. Science AI.
- Zhang, Q., et al. (2025b). Thyme-rl: Temporal Reasoning with Visual Tools. NeurIPS.
- Zhang, T., et al. (2026). AgentMME: Process-Verified Multimodal Benchmark.
- Zheng, H., et al. (2025). Fine-grained Visual Manipulation. ECCV.
- Tao, R., et al. (2025). TIR-Bench: Tool-Interactive Reasoning Benchmark.