With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. In contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers, backed by 3,300 human-hours and up to 5 rounds of quality assurance. Extensive experiments reveal a substantial gap between the current best model Gemini-3-Pro (49.4) and human experts (90.7).
Recent advancements in video-based multimodal large language models (video MLLMs) have led to remarkable progress across a variety of understanding and reasoning tasks. Despite this progress, existing evaluations often lack a comprehensive evaluation hierarchy, emphasizing performance on task-specific benchmarks or isolated topics, which makes holistic assessment difficult.
Furthermore, previous work mainly focuses on per-question accuracies, overlooking the need for consistent and trustworthy video comprehension in evaluation. A model that correctly answers each question independently may still fail to demonstrate genuine understanding when questions are grouped and evaluated for consistency. These limitations hinder thorough assessment of frontier video MLLMs.
Overall score on Video-MME-v2. The best AI model (Gemini-3-Pro) achieves only 49.4% versus human experts at 90.7%, revealing a fundamental gap that existing benchmarks fail to expose clearly.
Video-MME-v2 categorizes core video understanding skills into three progressive levels that incrementally increase in complexity. Unlike flat benchmarks that treat all skills as equal, this hierarchy reveals bottlenecks where lower-level failures propagate upward to limit higher-level reasoning.
Traditional per-question accuracy lets models "get lucky" on individual questions without genuine understanding. Video-MME-v2 introduces a group-based evaluation where a set of 4 related questions (Q1–Q4) about the same video content must all be answered correctly for the group to count as correct. This penalizes fragmented or guess-based correctness.
Two types of groups are defined: Consistency Groups test whether a model gives coherent answers across multiple perspectives of the same event; Coherence Groups test multi-step reasoning chains where each answer should logically follow from prior answers.
Think of it like a combination lock: getting each digit right individually isn't enough — they all need to be correct simultaneously. Traditional benchmarks work like checking each digit separately and rewarding partial credit. Video-MME-v2's group scoring instead requires a model to get all related questions right together — just like a lock requires every digit to align. This prevents models from gaming the benchmark through lucky guessing or partial understanding.
Data quality is ensured through a controlled annotation pipeline: 12 annotators created the questions and ground truth answers, while 50 independent reviewers verified the content across up to 5 rounds of quality assurance. This process ensures Video-MME-v2 contains only unambiguous, high-quality benchmark items that genuinely test video understanding rather than surface pattern matching.
Video-MME-v2 was used to evaluate 14+ frontier models spanning closed-source APIs (Gemini-3-Pro, GPT-5) and open-source models (Qwen, LLaVA, doubao-seed). All models were evaluated both with subtitles and without subtitles to measure visual understanding capability separately from text-dependent reasoning.
| Model | Frames | Overall (w. sub) | Overall (wo sub) | Level 1 (w. sub) | Level 2 (w. sub) | Level 3 (w. sub) |
|---|---|---|---|---|---|---|
| Human Expert | — | 90.7 | — | 94.8 | 91.1 | 87.9 |
| Gemini-3-Pro | 1fps | 49.4 | 38.2 | 64.0 | 50.0 | 40.6 |
| GPT-5 | 1fps | 43.3 | 35.2 | 54.4 | 47.0 | 34.1 |
| doubao-seed 2.0 pro | 1fps | 42.5 | 32.9 | 58.3 | 44.8 | 31.7 |
| llava-v2-onevision | 1fps | 38.6 | 29.9 | 52.6 | 43.1 | 27.4 |
| qwen2.5-70b-instruct | 50 | 37.0 | 26.4 | 44.5 | 39.1 | 31.1 |
Partial results shown. w. sub = with subtitles, wo sub = without subtitles. All scores are group-based accuracy (%). Human Expert baseline included for reference.
Each row shows a model's "group accuracy" — not how often it got individual questions right, but how often it got all 4 related questions correct simultaneously. w. sub = with subtitles (text cues available), wo sub = without subtitles (visual understanding only).
Video-MME-v2's findings highlight the multimodal model community's path forward: closing the gap at Level 1 (visual aggregation) is the critical first step. Without robust multi-frame perception, temporal modeling and complex reasoning remain fundamentally constrained. Future work should prioritize architectural advances in cross-frame attention and temporal grounding.
Video-MME-v2 establishes a demanding new testbed for next-generation video MLLMs. By combining a progressive tri-level evaluation hierarchy with group-based non-linear scoring, it exposes the limitations of current video understanding systems in a way that per-question accuracy cannot. The substantial human–AI gap (90.7 vs. 49.4) and the clear hierarchical bottleneck effect provide actionable insights for the research community. By exposing these limitations, Video-MME-v2 aims to drive the development of video MLLMs that are not just capable on leaderboards but genuinely robust and faithful in understanding real-world video content.
Video-MME-v2's findings suggest the research community should prioritize:
The benchmark is designed to remain relevant as models improve — its group-based evaluation makes it harder to "saturate" than question-level benchmarks.
Video-MME-v2 builds upon and extends previous video benchmarks including Video-MME (the original), MVBench, EgoSchema, TemporalBench, and VideoVista. While these benchmarks evaluate individual question accuracy on curated video sets, Video-MME-v2 uniquely introduces the group-based non-linear evaluation that requires consistent understanding across multiple related questions. On the evaluation methodology side, Video-MME-v2 is inspired by robustness evaluation approaches in NLP (such as adversarial test sets) and extends them to the video multimodal domain.
Video-MME-v2 covers a diverse range of real-world video content across multiple genres. The following sample frames illustrate the visual diversity and challenge level of the dataset.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.