---
arxiv_id: 2604.02317
title: "A Simple Baseline for Streaming Video Understanding"
authors:
  - Yujiao Shen
  - Shulin Tian
  - Jingkang Yang
  - Ziwei Liu
difficulty: Intermediate
tags:
  - Video
  - VLM
  - Streaming
  - Benchmark
published_at: 2026-04-02
flecto_url: https://flecto.zer0ai.dev/papers/2604.02317/
lang: en
---

> A Simple Baseline for Streaming Video Understanding

**Authors**: Yujiao Shen · Shulin Tian · Jingkang Yang · Ziwei Liu

## Abstract

### Abstract

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SIMPLESTREAM and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SIMPLESTREAM delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off : adding more historical context can improve recall, but often weakens real-time perception.

## Results

### Benchmark Results

SIMPLESTREAM is evaluated against 13 baselines on OVO-Bench and StreamingBench under a unified protocol.

## Conclusion

### Conclusion & Implications

SIMPLESTREAM is already strong enough to exceed recently published complex-memory streaming systems on both OVO-Bench and StreamingBench while remaining latency-competitive with the lowest peak GPU memory. This challenges the prevailing assumption that stronger memory mechanisms are necessary for streaming video understanding progress.

## References

### References (40+)

## Head Title

### A Simple Baseline for Streaming Video Understanding | Flecto

## Head Meta

SimpleStream: a sliding-window baseline feeding only the most recent N frames to a VLM already matches or surpasses published streaming models. 67.7% on OVO-Bench, 80.59% on StreamingBench with just 4 frames.

## Hero Button

### Read on arXiv ↗

### Jump to Results

## Key Findings

### Key Findings

## Key Findings Card=1

### SOTA with Just 4 Frames

### 67.7% OVO-Bench

SimpleStream (Qwen3-VL, 4 frames) surpasses all published streaming methods including HERMES by +8.5 percentage points.

## Key Findings Card=2

### Lowest GPU Memory

### ~15.6 GB flat

Fixed sliding window keeps peak GPU memory constant regardless of stream length, while other methods grow to 18–20 GB.

## Key Findings Card=3

### Fastest Inference

### 35–38 ms TTFT

SimpleStream matches HERMES as the fastest method, remaining latency-competitive without any specialized memory module.

## Overview

### What is SimpleStream?

### A Deliberate Minimalist Design

Given a query at time t , SIMPLESTREAM feeds the last N observed frames and the query text directly to the base VLM — nothing more. The design is minimal by construction: preserve only a short recent window and let a strong backbone operate on clear, uncompressed recent evidence.

By construction, SIMPLESTREAM omits the additional memory mechanisms used in prior streaming systems. Frames outside the sliding window are discarded, so per-query memory and computation remain bounded and constant regardless of how long the stream has been running.

## Overview Figure_001

Figure 1(a): SimpleStream feeds only the most recent N frames directly to the VLM — no memory bank, no retrieval, no compression. Top: complex streaming VLMs with Context Management layers. Bottom: SIMPLESTREAM's minimal design.

## Overview Figure_002

Figure 2: Prior streaming VLMs manage long context via External Memory, Retrieval, Compression, or Latent Memory — all complexity that SIMPLESTREAM deliberately avoids.

## Results Table_001

Table 1: Main results on OVO-Bench and StreamingBench. SIMPLESTREAM (Qwen3-VL-8B, 4 frames) achieves 67.70% average on OVO-Bench and 80.59% on StreamingBench, exceeding all published streaming methods.

## Results Callout

### SimpleStream Outperforms All Published Streaming Methods

### 67.7% on OVO-Bench (4 frames) — beats HERMES (59.2%) by +8.5pp

### 80.59% on StreamingBench — the highest reported score

### Lowest peak GPU memory: ~15.6 GB flat vs. up to 20 GB for competing methods

### TTFT: 35–38 ms — on par with the fastest published method (HERMES)

## Efficiency

### Efficiency: Memory & Latency

## Efficiency Figure_003

Figure 3: Peak GPU memory vs. observed frames. SimpleStream-4f maintains a flat ~15.6 GB regardless of stream length, while competing methods grow to 18–20 GB.

## Efficiency Table_003

Table 3: Time To First Token (ms) at 16/64/256 observed frames. SimpleStream-4f: 35/33/38 ms — matching the fastest published method.

## Analysis

### Analysis: Why Does Simplicity Win?

A common assumption in streaming video understanding is that more historical context should improve answers. However, our window-size ablation shows that 4 recent frames is already optimal — and going beyond that actually hurts realtime perception accuracy. Adding the 16-frame window reduces realtime accuracy from 81.4% to 77.9%.

These results suggest that strong streaming VLMs are already excellent short-horizon reasoners. Injecting historical context introduces noise, compression artifacts, or attention dilution that degrades the model's ability to reason about the current scene — even when the historical content is genuinely relevant.

## Analysis Window

### Longer Context Is Not Always Better

## Analysis Figure_004

Figure 4: Window-size ablation. Realtime accuracy peaks at 4 frames (81.4%) then declines. Overall accuracy saturates quickly beyond 4 frames.

## Analysis Scale

### Model Scale Effects

## Analysis Figure_005

Figure 5: Model-scaling ablation on OVO-Bench. Optimal window size is backbone-dependent, not uniformly increasing. Qwen2.5-VL-72B prefers 16 frames; Qwen3-VL-8B peaks at 4.

## Analysis Table_002

Table 2: Model scale effects under fixed window evaluation on OVO-Bench across Qwen2.5-VL and Qwen3-VL families.

## Analysis Tradeoff

### Perception-Memory Trade-off

## Analysis Figure_006

Figure 6: Perception-memory trade-off across methods. Almost every external baseline falls below SimpleStream on real-time perception accuracy (AP). Methods that add history improve recall but hurt perception.

## Analysis Table_004

Table 4: Visual-RAG ablation on OVO-Bench. Retrieving historical chunks improves EPM (+7.1) and ASI (+6.1) but degrades realtime perception by −2.3 on average.

## Conclusion Card=1

### For Model Builders

A strong modern VLM backbone + short recent window is already SOTA. Add memory complexity only if it clearly outperforms this simple baseline under the same evaluation protocol.

## Conclusion Card=2

### For Benchmark Designers

Separate recent-scene perception from long-range memory in future streaming benchmarks to correctly attribute performance gains from added complexity.

## Conclusion Cta

### Read Full Paper on arXiv ↗
