Video-MME-v2: Next-Gen Video Understanding Benchmark

Abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a progressive tri-level hierarchy that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. In contrast to conventional per-question accuracy, we propose a group-based non-linear evaluation strategy that enforces both consistency across related queries and coherence in multi-step reasoning. Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers, backed by 3,300 human-hours and up to 5 rounds of quality assurance. Extensive experiments reveal a substantial gap between the current best model Gemini-3-Pro (49.4) and human experts (90.7).

Introduction

Why Existing Video Benchmarks Fall Short

Recent advancements in video-based multimodal large language models (video MLLMs) have led to remarkable progress across a variety of understanding and reasoning tasks. Despite this progress, existing evaluations often lack a comprehensive evaluation hierarchy, emphasizing performance on task-specific benchmarks or isolated topics, which makes holistic assessment difficult.

Furthermore, previous work mainly focuses on per-question accuracies, overlooking the need for consistent and trustworthy video comprehension in evaluation. A model that correctly answers each question independently may still fail to demonstrate genuine understanding when questions are grouped and evaluated for consistency. These limitations hinder thorough assessment of frontier video MLLMs.

The AI–Human Performance Gap

49.4

Best AI Model

90.7

Human Expert

Overall score on Video-MME-v2. The best AI model (Gemini-3-Pro) achieves only 49.4% versus human experts at 90.7%, revealing a fundamental gap that existing benchmarks fail to expose clearly.

Methodology

Tri-Level Evaluation Hierarchy

Video-MME-v2 categorizes core video understanding skills into three progressive levels that incrementally increase in complexity. Unlike flat benchmarks that treat all skills as equal, this hierarchy reveals bottlenecks where lower-level failures propagate upward to limit higher-level reasoning.

Level 1

Visual Information Aggregation

Perceiving and aggregating cross-frame and cross-modal information. Tests: color, action detection, temporal ordering, frame-audio alignment, physical world perception.

→

Level 2

Temporal Dynamics Modeling

Evaluating causal reasoning, state change tracking, and sequential understanding. Tests: change detection, temporal reasoning, event causality, sequence prediction.

→

Level 3

Complex Multimodal Reasoning

Advanced video comprehension mimicking real-world scenarios. Tests: social intelligence, complex plot comprehension, video-based knowledge acquisition.

Group-Based Non-Linear Evaluation Strategy

Traditional per-question accuracy lets models "get lucky" on individual questions without genuine understanding. Video-MME-v2 introduces a group-based evaluation where a set of 4 related questions (Q1–Q4) about the same video content must all be answered correctly for the group to count as correct. This penalizes fragmented or guess-based correctness.

Score(group) = Π(Q1, Q2, Q3, Q4) — all questions must be correct for group credit
Consistency Group: Q1-Q4 test same video segment from multiple angles
Coherence Group: Questions require multi-step reasoning chain

Two types of groups are defined: Consistency Groups test whether a model gives coherent answers across multiple perspectives of the same event; Coherence Groups test multi-step reasoning chains where each answer should logically follow from prior answers.

Why "non-linear" evaluation matters

Think of it like a combination lock: getting each digit right individually isn't enough — they all need to be correct simultaneously. Traditional benchmarks work like checking each digit separately and rewarding partial credit. Video-MME-v2's group scoring instead requires a model to get all related questions right together — just like a lock requires every digit to align. This prevents models from gaming the benchmark through lucky guessing or partial understanding.

Dataset

Dataset: Diverse, Fresh, and Rigorously Annotated

Major Categories

800

Videos

3,300+

Human Hours

QA Rounds

Video-MME-v2 category hierarchy circular chart — **Figure 2(a):** Category hierarchy showing 5 major domains (Sports & Competition, Knowledge & Education, Arts & Entertainment, Daily Life, Entertainment & Culture) with dozens of fine-grained subcategories.

Video publish month distribution — **Figure 2(b):** Video publication date distribution (2024–2026). The majority of videos are from 2025–2026, minimizing data contamination risk from model training sets.

Video length distribution and word count statistics — **Figure 2(c,d):** Video length distribution (top) and word count statistics for questions, answers, and choices (bottom), showing diverse difficulty levels.

**Figure 2(e):** Video view count distribution (log scale) covering niche to viral content, ensuring representativeness across content popularity levels.

Rigorous Annotation Pipeline

Data quality is ensured through a controlled annotation pipeline: 12 annotators created the questions and ground truth answers, while 50 independent reviewers verified the content across up to 5 rounds of quality assurance. This process ensures Video-MME-v2 contains only unambiguous, high-quality benchmark items that genuinely test video understanding rather than surface pattern matching.

Why 3,300 hours? At an average of 30 minutes per video annotation (watching, writing questions, verifying answers), annotating 800+ videos requires enormous effort. The 5-round QA process means each item was reviewed approximately 50 times before inclusion. This contrasts with many benchmarks that use automated or crowd-sourced annotation with far less rigorous verification.

Model Performance

Model Performance Overview

Model comparison benchmark results — **Figure 1:** Comprehensive model comparison across all 5 categories. The circular chart (left) shows per-category scores; the bar chart (right) ranks models from Human Expert down to open-source models. Gemini-3-Pro leads all AI models but remains far below human performance.

Experiments & Results

Experiments and Results

Video-MME-v2 was used to evaluate 14+ frontier models spanning closed-source APIs (Gemini-3-Pro, GPT-5) and open-source models (Qwen, LLaVA, doubao-seed). All models were evaluated both with subtitles and without subtitles to measure visual understanding capability separately from text-dependent reasoning.

Model	Frames	Overall (w. sub)	Overall (wo sub)	Level 1 (w. sub)	Level 2 (w. sub)	Level 3 (w. sub)
Human Expert	—	90.7	—	94.8	91.1	87.9
Gemini-3-Pro	1fps	49.4	38.2	64.0	50.0	40.6
GPT-5	1fps	43.3	35.2	54.4	47.0	34.1
doubao-seed 2.0 pro	1fps	42.5	32.9	58.3	44.8	31.7
llava-v2-onevision	1fps	38.6	29.9	52.6	43.1	27.4
qwen2.5-70b-instruct	50	37.0	26.4	44.5	39.1	31.1

Partial results shown. w. sub = with subtitles, wo sub = without subtitles. All scores are group-based accuracy (%). Human Expert baseline included for reference.

How to read this table

Each row shows a model's "group accuracy" — not how often it got individual questions right, but how often it got all 4 related questions correct simultaneously. w. sub = with subtitles (text cues available), wo sub = without subtitles (visual understanding only).

The gap between w. sub and wo sub reveals how much a model relies on text vs. genuine visual understanding
Level 1 → Level 3: Scores consistently drop as complexity increases, confirming the hierarchical bottleneck
Even Gemini-3-Pro (49.4 overall) performs only half as well as human experts (90.7)

Consistency and coherence group accuracy across Q1-Q4 — **Figure 4:** Accuracy trends across Q1–Q4 in consistency groups (a) and coherence groups (b), plus mean vs. variance scatter (c). Shows how models that score well per-question can fail under group evaluation.

Per-level performance breakdown with/without subtitles — **Figure 5:** Per-level (L1/L2/L3) performance with Thinking Gain/Regression markers. Thinking-based models improve with subtitles but sometimes regress without them, showing over-reliance on textual cues.

Radar chart showing multi-dimensional performance — **Figure 6:** Radar chart across all sub-tasks (Physical World, Temporal Ordering, Action & Motion, Color, Complex Plot, etc.). Human expert dominates across all axes; AI models show particular weakness in Temporal Reasoning and Complex Plot Comprehension.

Discussion

Key Findings

🧠

Hierarchical Bottleneck Effect

Errors in Level 1 (visual aggregation) propagate upward and limit Level 3 reasoning ability. Models that struggle to aggregate multi-frame information cannot compensate at the complex reasoning level.

📖

Subtitle Dependency in Thinking Models

Thinking-based reasoning models (those using extended inference-time compute) improve with subtitles but sometimes degrade in purely visual settings, exposing their dependence on textual cues rather than genuine visual understanding.

Real-world analogy: Imagine a student who memorizes individual facts but can't apply them holistically when the question format changes. "Thinking models" that reason step-by-step excel when they can read the subtitle/transcript as a text comprehension task, but struggle when forced to rely purely on visual perception — like a student who can read about a soccer match but struggles to watch and understand it live.

🎯

Group Evaluation Reveals Hidden Failures

Per-question accuracy misses consistency and coherence failures. The group-based strategy reveals that models scoring well individually may still show fragmented understanding when tested across related question sets.

👤

Substantial Human Expert Gap

The best AI model (Gemini-3-Pro, 49.4%) achieves only about half the performance of human experts (90.7%), indicating that despite rapid progress, fundamental video understanding capabilities remain far below human level.

Video-MME-v2's findings highlight the multimodal model community's path forward: closing the gap at Level 1 (visual aggregation) is the critical first step. Without robust multi-frame perception, temporal modeling and complex reasoning remain fundamentally constrained. Future work should prioritize architectural advances in cross-frame attention and temporal grounding.

Conclusion

Video-MME-v2 establishes a demanding new testbed for next-generation video MLLMs. By combining a progressive tri-level evaluation hierarchy with group-based non-linear scoring, it exposes the limitations of current video understanding systems in a way that per-question accuracy cannot. The substantial human–AI gap (90.7 vs. 49.4) and the clear hierarchical bottleneck effect provide actionable insights for the research community. By exposing these limitations, Video-MME-v2 aims to drive the development of video MLLMs that are not just capable on leaderboards but genuinely robust and faithful in understanding real-world video content.

What comes next for video AI?

Video-MME-v2's findings suggest the research community should prioritize:

Cross-frame attention: Better architectures for aggregating information across multiple frames without losing temporal ordering
Visual grounding: Reducing dependence on text/subtitle cues by improving genuine pixel-level understanding
Temporal reasoning: New training objectives that explicitly reward causal and sequential reasoning in video

The benchmark is designed to remain relevant as models improve — its group-based evaluation makes it harder to "saturate" than question-level benchmarks.

Related Work

Video-MME-v2 builds upon and extends previous video benchmarks including Video-MME (the original), MVBench, EgoSchema, TemporalBench, and VideoVista. While these benchmarks evaluate individual question accuracy on curated video sets, Video-MME-v2 uniquely introduces the group-based non-linear evaluation that requires consistent understanding across multiple related questions. On the evaluation methodology side, Video-MME-v2 is inspired by robustness evaluation approaches in NLP (such as adversarial test sets) and extends them to the video multimodal domain.