---
arxiv_id: 2604.03016
title: "Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?"
authors:
  - Qianshan Wei
  - Yishan Yang
  - Siyi Wang
  - Jinglin Chen
  - Binyu Wang
  - Jiaming Wang
  - Shuang Chen
  - Zechen Li
  - Yang Shi
  - Yuqi Tang
  - Weining Wang
  - Yi Yu
  - Chaoyou Fu
  - Qi Li
  - Yi-Fan Zhang
difficulty: Advanced
tags:
  - Agent
  - Benchmark
  - Multimodal
  - Vision
published_at: 2026-04-03
flecto_url: https://flecto.zer0ai.dev/papers/2604.03016/
lang: en
---

## Page Title

### Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? | Flecto

## Meta Description

A process-verified benchmark with 418 real-world tasks evaluating multimodal agentic capabilities. Best model: Gemini 3 Pro 56.3% vs Human 93.8%.

## Hero H1

### Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

## Hero Subtitle

A process-verified benchmark with 418 real-world multimodal tasks, 3 difficulty levels, and 2,000+ stepwise checkpoints — going beyond final-answer grading to audit how models use tools.

## Hero Metric 1

### Real-world tasks across 6 domains

## Hero Metric 2

### Best model overall accuracy (Gemini 3 Pro)

## Hero Metric 3

### Human performance (vs 23.0% best on Level 3)

## Hero Metric 4

### Stepwise checkpoints, avg 10+ person-hours/task

## Hero Abstract Heading

### Abstract

## Hero Abstract Para1

Multimodal Large Language Models (MLLMs) are rapidly evolving from passive observers into active agents. These agents increasingly solve problems through Visual Expansion — actively invoking visual tools to transform images — and Knowledge Expansion — combining visual operations with open-web search to retrieve facts beyond what is visually present. However, existing evaluations fall short in three critical ways: they lack flexibility in tool integration, test visual and search tools in isolation, and evaluate only by final-answer correctness. This means they cannot verify whether tools were actually invoked, applied correctly, or used efficiently.

## Hero Abstract Para2

To answer what agentic capability truly brings to multimodal intelligence, we introduce Agentic-MME — a process-verified benchmark containing 418 real-world tasks across 6 domains and 3 difficulty levels, featuring over 2,000 stepwise checkpoints averaging more than 10 person-hours of manual annotation per task. Each task is paired with a unified evaluation framework supporting sandboxed code execution and structured tool APIs, and a human reference trajectory annotated along a dual-axis: S-axis (strategy & tool execution) and V-axis (visual evidence verification). Experimental results show the best model, Gemini 3 Pro, achieves 56.3% overall accuracy — falling significantly to 23.0% on Level-3 tasks — underscoring the immense difficulty of real-world multimodal agentic problem solving.

## Hero Action Button

### Read on arXiv ↗

## Introduction H2

### 1. Introduction

## Introduction Para1

Multimodal Large Language Models are rapidly evolving from passive observers to active investigators . Rather than answering from a static snapshot, modern systems increasingly solve problems by interacting: they manipulate images to surface fine-grained evidence and consult external resources to verify facts not visually present. This shift represents multimodal agentic capability , decomposable into two core dimensions: (1) Visual Expansion , which enables models to think with images by actively transforming and analyzing inputs (e.g., cropping, rotating, enhancing) to uncover latent cues; and (2) Knowledge Expansion , which enables models to go beyond parametric memory via open-ended web search to validate real-world facts and resolve ambiguity.

## Introduction Para2

While this active paradigm promises to solve complex real-world problems, current evaluation for multimodal agentic capabilities remains fragmented and insufficient . Most existing benchmarks capture specific aspects of tool use but fail in three critical dimensions — and a true multimodal agent benchmark must address all three simultaneously.

## Introduction Gap Callout Heading

### Three Critical Gaps in Prior Benchmarks

## Introduction Gap 1

Inflexible tool integration: Current evaluations decouple visual tool use from open-web search, treating them as independent modules. No unified framework allows agents to fluidly select and switch between arbitrary visual and search tools.

## Introduction Gap 2

Unexplored synergy: The interplay between Visual and Knowledge Expansion is largely untested. True multimodal agents must excel at "intertwined" tasks that cannot be solved by simple visual expansion or isolated knowledge expansion alone.

## Introduction Gap 3

No process verification: Existing evaluations focus on final-answer correctness, offering no insight into whether tools were invoked, applied correctly, or used efficiently. Unfaithful tool execution remains hidden.

## Introduction Solution

Agentic-MME addresses all three gaps with a carefully designed benchmark that includes 418 real-world tasks, a unified execution harness supporting both Code (Gen) and Atomic (Atm) tool interfaces, and over 2,000 human-annotated stepwise checkpoints that enable fine-grained process-level verification.

## Figure_001 Caption

Figure 1: Three difficulty levels of Agentic-MME. Level 1 requires a single decisive visual operation. Level 2 requires multi-step workflows combining visual manipulation and web search. Level 3 demands advanced synergistic reasoning under ambiguity.

## Benchmark H2

### 2. Agentic-MME Benchmark

## Benchmark 2.1 H3

### 2.1 Overview

## Benchmark Overview Para

Agentic-MME is designed to evaluate multimodal agentic capabilities in realistic scenarios where agents actively utilize visual tools to transform and perceive image content, and — contingent on task requirements — coordinate with open-web search to retrieve essential external knowledge. Unlike benchmarks that isolate visual operations or web search, Agentic-MME targets the deep synergy between these two capabilities. The benchmark comparison (Table 1) shows that Agentic-MME is the only benchmark that simultaneously supports heterogeneous tool interfaces, tests tool synergy, enables process verification, measures efficiency, and defines difficulty levels.

## Benchmark Table1 Heading

### Table 1: Comparison with Existing Multimodal Agentic Benchmarks

## Benchmark Table1 Caption

Comparison across key capabilities and evaluation protocol dimensions. Agentic-MME (bottom row) is the only benchmark covering all dimensions: image tools, search core, process verification, unified code+tool interface, efficiency metric, and difficulty levels.

## Benchmark 2.2 H3

### 2.2 Task Setup, Difficulty, and Metrics

## Benchmark Task Setup Para

Each instance provides one or more images and a question. Agents solve the task by actively manipulating images within a unified tool-augmented interface equipped with 13 distinct visual operations for Visual Expansion and 4 open-web retrieval tools for Knowledge Expansion. Tasks are systematically stratified into three difficulty levels based on the interaction complexity along a reasonable solution path.

## Level1 Card Heading

### Easy — Single Operation

## Level1 Card Desc

Requires a single visual operation (e.g., one crop or rotation). Avg 2.89 checkpoints, 1.21 tools per task.

## Level2 Card Heading

### Mid — Multi-Step Workflow

## Level2 Card Desc

Requires multi-step workflows combining image manipulation with optional web search. Avg 4.64 checkpoints, 2.42 tools per task.

## Level3 Card Heading

### Hard — Advanced Synergy

## Level3 Card Desc

Demands intertwined, multi-round interactions between visual manipulation and web search. Cannot be solved by simple sequential tool chaining. Avg 6.67 checkpoints, 4.07 tools per task.

## Benchmark Table2 Caption

Table 2: Task difficulty distribution. Level 3 tasks comprise 19.4% of the benchmark and require 4.07 tools and 6.67 checkpoints on average — significantly more complex than Level 1.

## Benchmark Metrics Heading

### Evaluation Metrics

## Acc Metric

### Final answer accuracy

## S Metric

### S-axis: strategy & tool execution quality

## V Metric

### V-axis: visual evidence verification

## Ot Metric

### Overthinking: excess tool calls relative to human trajectories

## Benchmark 2.3 H3

### 2.3 Data Collection and Annotation

## Benchmark Data Collection Para

The data collection pipeline uses a Backward Drafting approach: rather than writing questions first, annotators start from high-resolution, visually complex images that require visual tools to perceive, then construct multi-step trajectories grounding each step in tool actions and visual ground truth. This ensures tool invocation is necessary, not optional. The pipeline proceeds through four stages: image sourcing, backward drafting, granular annotation, and quality assurance.

## Figure_002 Caption

Figure 2: Data collection and annotation pipeline. (1) High-resolution, visually complex images are sourced. (2) Backward drafting: annotators work backward from evidence to formulate questions requiring tool use. (3) Step-wise annotation of tool actions and visual ground truth. (4) Quality assurance through consensus and independent verification.

## Figure_003 Caption

Figure 3: Dataset statistics. (a) Hierarchical distribution across 6 domains (Culture 12.5%, Finance 19.5%, Diagram 31.3%, Science 12.2%, Society 18.4%, Life 14.4%). (b) Token distribution for prompts and answers. (c) Word cloud of prompt keywords. (d) Average tool calls and checkpoints per difficulty level — Level 3 requires the most tool interactions.

## Benchmark Table3 Heading

### Table 3: Dataset Key Properties

## Benchmark Table3 Caption

430 images, 899 tools, 6 domains / 35 sub-domains. Average image resolution: 1952×1747 px. 43.1% of tasks have small visual cues (<10% of image area). 29.4% of tasks require external web search.

## Benchmark 2.4 H3

### 2.4 Quality Control and Assurance

## Benchmark Quality Para

Each annotated task undergoes multiple rounds of independent verification. Quality control involves step-wise oracle testing — systematically probing edge cases and failure modes — followed by consensus auditing where multiple experts must agree on the ground truth trajectory. Tasks failing to meet consistency thresholds are revised or discarded. This rigorous process ensures the 2,000+ stepwise checkpoints are grounded, reproducible, and faithfully represent human-level reasoning trajectories.

## Benchmark Quality Callout

Each task averages more than 10 person-hours of manual annotation — reflecting the depth of process-level verification required to capture faithful, step-by-step reasoning trajectories.

## Benchmark 2.5 H3

### 2.5 Unified Tool Interface and Execution Harness

## Benchmark Tool Interface Para

A central design goal of Agentic-MME is to benchmark agentic capability across heterogeneous tool implementations. The unified execution harness supports two interfaces: Code mode (Gen) , where models write sandboxed Python to perform visual transforms, and Atomic mode (Atm) , where models interact via structured function calls following OpenAI-compatible JSON schemas. This controlled comparison tests whether tool competence generalizes across interfaces rather than being tied to one training format.

## Benchmark Visual Expansion

### Visual Expansion (13 tools)

## Benchmark Visual Expansion Desc

Active image manipulation tools that transform images to surface hidden evidence, extract fine-grained details, or apply spatial transformations.

## Benchmark Knowledge Expansion

### Knowledge Expansion (4 tools)

## Benchmark Knowledge Expansion Desc

Open-web retrieval tools that extend beyond parametric memory to verify real-world facts and resolve ambiguity through external search.

## Experiments H2

### 3. Experiments

## Experiments 3.1 H3

### 3.1 Experimental Setup

## Experiments Setup Para1

We evaluate a diverse set of models on Agentic-MME, including open-source models (Thyme-rl, DeepeEyes-V2, Qwen3-VL-235B, Qwen3-VL-8B-thinking, Qwen3-VL-32B-thinking) and closed-source models (Gemini 3 family, Kimi-k2.5, GPT-5.2, Qwen3.5-plus). A human reference baseline is obtained by averaging three independent human solvers allowed to use search engines and perception tools.

## Experiments Setup Para2

Each model is evaluated under both tool interfaces: Code mode (Gen) for sandboxed Python execution and Atomic mode (Atm) for structured function calls. This controlled comparison directly tests whether tool competence generalizes across interfaces. All evaluations run on fully logged, replayable traces, with GPT-5-mini as the primary judge — validated by human experts showing consistent results across judge choices (Table 8).

## Experiments 3.2 H3

### 3.2 Main Results on Agentic-MME

## Experiments Finding1 Badge

### Finding 1

## Experiments Finding1 Text

All models fall far below human performance , with a sharp accuracy drop on Level-3. Human solvers reach 93.8% overall and remain strong on the hardest split (L3: 82.3%). The best model, Gemini 3 Pro (Atm), achieves 56.3% overall — but only 33.3% on Level-3. Without tools, Gemini 3 Pro drops to 7.5% on L3; tool access provides a 4.4× improvement to 33.3%, yet the gap to human (82.3%) remains vast.

## Experiments Finding2 Badge

### Finding 2

## Experiments Finding2 Text

Open-source models lag behind closed-source , primarily in search and planning. The gap is most visible on Level-3: Qwen3 VL-235B drops to 10.1% and Thyme-rl collapses to 2.5%. The S-axis reveals the mechanism — current open-source models can invoke tools but have not yet acquired the retrieval and planning sophistication needed to chain multi-step workflows reliably.

## Experiments Finding3 Badge

### Finding 3

## Experiments Finding3 Text

Atomic (Atm) mode generally improves accuracy over Code (Gen) mode across models. This suggests that structured function-call interfaces reduce implementation errors and provide clearer boundaries for tool use, enabling more reliable stepwise execution.

## Experiments Table4 Heading

### Table 4: Main Results on Agentic-MME

## Experiments Table4 Caption

Results for all evaluated models in Gen and Atm modes across Overall, Level 1 (L1), Level 2 (L2), and Level 3 (L3). Metrics: Acc = accuracy, S = S-axis score, V = V-axis validity, V IT /V FT = intent/fidelity tracking. Human achieves 93.8% overall vs best model 56.3%.

## Experiments 3.3 H3

### 3.3 Further Analysis

## Experiments Further Analysis Para

We conduct two analyses to understand the sources of performance gaps: (1) an ablation study isolating the contribution of each tool category, and (2) an upper-bound analysis providing visual cues and stepwise guidance to quantify the potential improvement achievable with better tool execution.

## Experiments Table5 Heading

### Table 5: Tool Ablation Study

## Experiments Table5 Caption

Ablation results for Gemini 3 Flash and Qwen3 VL-235B. Settings: Perception-only (no tools), Image-only (visual tools only), Search-only (web search only), Full (both). Full integration achieves the best performance — visual and search tools are complementary, not redundant.

## Experiments Table6 Heading

### Table 6: Upper-Bound Analysis

## Experiments Table6 Caption

Performance when providing visual cues (+Visual Cues) and stepwise guidance (+Stepwise Guidance). Stepwise guidance boosts Gemini 3 Flash from 52.24% to 76.21% — showing the ceiling is achievable with better planning. The gap between current autonomous performance and guided performance represents the frontier for agentic reasoning improvement.

## Experiments Insight Callout

Adding stepwise guidance boosts performance from 52.24% to 76.21% — a 24-point jump that shows current models fail primarily on planning and execution reliability , not fundamental perceptual capability.

## Experiments 3.4 H3

### 3.4 Fine-Grained Error Analysis

## Experiments Error Analysis Para

To understand how models fail, we conduct fine-grained error analysis across all three difficulty levels. The heatmap (Figure 4) shows error category distribution for L1, L2, L3, and Overall, revealing that error patterns change significantly with difficulty. L3 tasks show much higher rates of multi-hop reasoning failures, search integration errors, and visual ambiguity mismanagement — the core challenges unique to advanced synergistic workflows.

## Figure_004 Caption

Figure 4: Fine-grained error analysis heatmaps for L1, L2, L3, and Overall. Each row is a model; each column is an error category. Darker cells indicate higher error rates in that category. Error distribution shifts substantially from L1 (simpler single-step errors) to L3 (complex multi-hop and synergy failures).

## Experiments Table7 Heading

### Table 7: Tool Call Efficiency (Calls & Overthinking)

## Experiments Table7 Caption

Average tool calls and Overthinking (OT) score per model in Gen and Atm modes. GPT-5-mini makes the most tool calls (Gen: 12.13, Atm: 7.22). High OT indicates redundant tool usage relative to human reference trajectories.

## Experiments Table8 Heading

### Table 8: Judge Consistency Validation

## Experiments Table8 Caption

Evaluation results using different judges (GPT-5-mini, Gemini-2.5-Flash, GPT-4o-mini, Human Expert). All judges produce identical Acc (56.28), confirming evaluation stability regardless of judge choice.

## Related H2

### 4. Related Work

## Related Card1 Heading

### Tool-Augmented Visual Reasoning

## Related Card1 Desc

Traditional benchmarks target static multimodal inputs. Recent work explores active multi-tool execution and visual manipulation, but typically treats open-web retrieval as peripheral — Google Search constitutes less than 7% of tool calls in prior benchmarks — failing to assess synergy between Visual and Knowledge Expansion.

## Related Card2 Heading

### Multimodal Search & Information Seeking

## Related Card2 Desc

Complementary work focuses on open-world information seeking and multimodal web browsing. However, relying solely on final-answer correctness — as shown by CodeV — can mask unfaithful tool execution. Intermediate visual artifacts often remain unverified, despite growing consensus on the need for strict, step-wise process verification.

## Related Card3 Heading

### Process-Level Evaluation

## Related Card3 Desc

Recent multimodal deep-research frameworks advance long-form report synthesis, but their primary objective is knowledge retrieval with limited visual grounding rigor. Agentic-MME uniquely combines fine-grained process verification with tool efficiency measurement, establishing a diagnostic framework for the next generation of multimodal agents.

## Conclusion H2

### 5. Conclusion

## Conclusion Para1

We introduce Agentic-MME, a process-verified benchmark designed to systematically evaluate the deep synergy between active visual manipulation (Visual Expansion) and open-web retrieval (Knowledge Expansion) in multimodal agents. Moving beyond opaque final-answer grading, we contribute a unified execution harness supporting heterogeneous tool interfaces, grounded in over 2,000 human-annotated stepwise checkpoints. This dual-axis framework enables granular auditing of intermediate tool intent, visual artifact faithfulness, and execution efficiency.

## Conclusion Para2

Our evaluation exposes a critical gap between frontier models and human performance, particularly in complex workflows. While current models can execute simple sequential tool-chaining, they severely struggle with advanced synergistic tasks — resolving visual ambiguity through fuzzy search, conducting iterative hypothesis verification across modalities. By pinpointing these bottlenecks — unfaithful tool execution and redundant "overthinking" loops — Agentic-MME provides a rigorous, diagnostic roadmap for developing robust, long-horizon multimodal agents.

## Conclusion Takeaway Heading

### Key Takeaway for Practitioners

## Conclusion Takeaway Text

If you're building multimodal AI systems, Agentic-MME reveals that the bottleneck is not perception — it's planning under ambiguity and faithful multi-tool orchestration . Models that score well on static benchmarks can still collapse on Level-3 tasks requiring iterative visual search and cross-modal verification. Agentic-MME is the diagnostic tool to identify and fix these gaps.

## References Accordion Summary

### References (39)
