A Simple Baseline for Streaming Video Understanding

Key Findings

🎯

67.7% OVO-Bench

SOTA with Just 4 Frames

SimpleStream (Qwen3-VL, 4 frames) surpasses all published streaming methods including HERMES by +8.5 percentage points.

💾

~15.6 GB flat

Lowest GPU Memory

Fixed sliding window keeps peak GPU memory constant regardless of stream length, while other methods grow to 18–20 GB.

⚡

35–38 ms TTFT

Fastest Inference

SimpleStream matches HERMES as the fastest method, remaining latency-competitive without any specialized memory module.

Abstract

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SIMPLESTREAM and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SIMPLESTREAM delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception.

What is SimpleStream?

SimpleStream framework overview — **Figure 1(a):** SimpleStream feeds only the most recent N frames directly to the VLM — no memory bank, no retrieval, no compression. Top: complex streaming VLMs with Context Management layers. Bottom: SIMPLESTREAM's minimal design.

A Deliberate Minimalist Design

Given a query at time t, SIMPLESTREAM feeds the last N observed frames and the query text directly to the base VLM — nothing more. The design is minimal by construction: preserve only a short recent window and let a strong backbone operate on clear, uncompressed recent evidence.

By construction, SIMPLESTREAM omits the additional memory mechanisms used in prior streaming systems. Frames outside the sliding window are discarded, so per-query memory and computation remain bounded and constant regardless of how long the stream has been running.

Unlike methods that maintain a growing memory database of past observations, SimpleStream simply forgets anything older than the last N frames. This is the key insight: modern VLMs are already strong enough that a small window of recent, uncompressed frames beats a larger window of compressed or retrieved context.

Taxonomy of streaming VLM approaches — **Figure 2:** Prior streaming VLMs manage long context via External Memory, Retrieval, Compression, or Latent Memory — all complexity that SIMPLESTREAM deliberately avoids.

Benchmark Results

SIMPLESTREAM is evaluated against 13 baselines on OVO-Bench and StreamingBench under a unified protocol.

Main results table on OVO-Bench and StreamingBench

Table 1: Main results on OVO-Bench and StreamingBench. SIMPLESTREAM (Qwen3-VL-8B, 4 frames) achieves 67.70% average on OVO-Bench and 80.59% on StreamingBench, exceeding all published streaming methods.

SimpleStream Outperforms All Published Streaming Methods

67.7% on OVO-Bench (4 frames) — beats HERMES (59.2%) by +8.5pp
80.59% on StreamingBench — the highest reported score
Lowest peak GPU memory: ~15.6 GB flat vs. up to 20 GB for competing methods
TTFT: 35–38 ms — on par with the fastest published method (HERMES)

The gap between SimpleStream and complex streaming methods is striking: a 7B model with just 4 recent frames beats 7B models with sophisticated memory banks. This suggests the bottleneck is not context length but the quality of that context — clean, recent frames trump compressed, noisy history.

Efficiency: Memory & Latency

Peak GPU memory vs observed frames — **Figure 3:** Peak GPU memory vs. observed frames. SimpleStream-4f maintains a flat ~15.6 GB regardless of stream length, while competing methods grow to 18–20 GB.

TTFT latency comparison — **Table 3:** Time To First Token (ms) at 16/64/256 observed frames. SimpleStream-4f: **35/33/38 ms** — matching the fastest published method.

Analysis: Why Does Simplicity Win?

Longer Context Is Not Always Better

Window size ablation chart — **Figure 4:** Window-size ablation. Realtime accuracy peaks at 4 frames (81.4%) then declines. Overall accuracy saturates quickly beyond 4 frames.

A common assumption in streaming video understanding is that more historical context should improve answers. However, our window-size ablation shows that 4 recent frames is already optimal — and going beyond that actually hurts realtime perception accuracy. Adding the 16-frame window reduces realtime accuracy from 81.4% to 77.9%.

Why does realtime accuracy drop with more frames? Each additional frame shifts the model's attention distribution — the VLM's limited attention budget means recent frames compete with older ones. When you need to answer questions about the current scene, older frames are noise, not signal.

Model Scale Effects

Model scaling ablation — **Figure 5:** Model-scaling ablation on OVO-Bench. Optimal window size is backbone-dependent, not uniformly increasing. Qwen2.5-VL-72B prefers 16 frames; Qwen3-VL-8B peaks at 4.

Table 2: Model scale effects under fixed window evaluation on OVO-Bench across Qwen2.5-VL and Qwen3-VL families.

Perception-Memory Trade-off

Visual-RAG ablation results — **Table 4:** Visual-RAG ablation on OVO-Bench. Retrieving historical chunks improves EPM (+7.1) and ASI (+6.1) but degrades realtime perception by −2.3 on average.

Key Insight: Memory vs. Perception Trade-off — Adding historical context improves recall tasks (EPM: +7.1pp, ASI: +6.1pp) but consistently degrades real-time perception tasks (OJR: −9.2pp, ACR: −71.6pp). This is not a coincidence but a fundamental tension: the attention mechanism must choose between processing the current frame and processing retrieved history.

These results suggest that strong streaming VLMs are already excellent short-horizon reasoners. Injecting historical context introduces noise, compression artifacts, or attention dilution that degrades the model's ability to reason about the current scene — even when the historical content is genuinely relevant.

Conclusion & Implications

SIMPLESTREAM is already strong enough to exceed recently published complex-memory streaming systems on both OVO-Bench and StreamingBench while remaining latency-competitive with the lowest peak GPU memory. This challenges the prevailing assumption that stronger memory mechanisms are necessary for streaming video understanding progress.

For Model Builders

A strong modern VLM backbone + short recent window is already SOTA. Add memory complexity only if it clearly outperforms this simple baseline under the same evaluation protocol.

For Benchmark Designers

Separate recent-scene perception from long-range memory in future streaming benchmarks to correctly attribute performance gains from added complexity.

Read Full Paper on arXiv ↗

References (40+)

Qian et al. (2024, 2025). Streaming video understanding with memory mechanisms. arXiv.
Li et al. (2025b). OVO-Bench: How Far is Your Video-Language Model from a Proficient Omnidirectional Video Observer? arXiv:2501.05510.
Lin et al. (2024). StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding. arXiv:2411.03628.
Zhang et al. (2026). HERMES: A Unified Self-Driving Perception, Prediction and Planning Model. arXiv:2601.08510.
Zeng et al. (2025). StreamForest: Towards Streaming Video Understanding with Tree-structured Memory. arXiv:2503.12254.
Bai et al. (2025a). Qwen3-VL Technical Report. arXiv:2504.10479.
Bai et al. (2025b). Qwen2.5-VL Technical Report. arXiv:2502.13923.
Yao et al. (2025). TimeChat-Online: Time-sensitive Multimodal Large Language Models for Streaming Video Comprehension. arXiv:2504.06958.
Di et al. (2025). Dispider: Multi-Scale Temporal Perception for Streaming Video LLMs. arXiv:2501.03218.
Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Radford et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML (CLIP).
Xia et al. (2025). Streamo-7B: Streaming Video LLM with Latent Memory. arXiv.