S-Lab, Nanyang Technological University
SimpleStream (Qwen3-VL, 4 frames) surpasses all published streaming methods including HERMES by +8.5 percentage points.
Fixed sliding window keeps peak GPU memory constant regardless of stream length, while other methods grow to 18โ20 GB.
SimpleStream matches HERMES as the fastest method, remaining latency-competitive without any specialized memory module.
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SIMPLESTREAM and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SIMPLESTREAM delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception.
Given a query at time t, SIMPLESTREAM feeds the last N observed frames and the query text directly to the base VLM โ nothing more. The design is minimal by construction: preserve only a short recent window and let a strong backbone operate on clear, uncompressed recent evidence.
By construction, SIMPLESTREAM omits the additional memory mechanisms used in prior streaming systems. Frames outside the sliding window are discarded, so per-query memory and computation remain bounded and constant regardless of how long the stream has been running.
SIMPLESTREAM is evaluated against 13 baselines on OVO-Bench and StreamingBench under a unified protocol.
Table 1: Main results on OVO-Bench and StreamingBench. SIMPLESTREAM (Qwen3-VL-8B, 4 frames) achieves 67.70% average on OVO-Bench and 80.59% on StreamingBench, exceeding all published streaming methods.
A common assumption in streaming video understanding is that more historical context should improve answers. However, our window-size ablation shows that 4 recent frames is already optimal โ and going beyond that actually hurts realtime perception accuracy. Adding the 16-frame window reduces realtime accuracy from 81.4% to 77.9%.
Table 2: Model scale effects under fixed window evaluation on OVO-Bench across Qwen2.5-VL and Qwen3-VL families.
These results suggest that strong streaming VLMs are already excellent short-horizon reasoners. Injecting historical context introduces noise, compression artifacts, or attention dilution that degrades the model's ability to reason about the current scene โ even when the historical content is genuinely relevant.
SIMPLESTREAM is already strong enough to exceed recently published complex-memory streaming systems on both OVO-Bench and StreamingBench while remaining latency-competitive with the lowest peak GPU memory. This challenges the prevailing assumption that stronger memory mechanisms are necessary for streaming video understanding progress.
A strong modern VLM backbone + short recent window is already SOTA. Add memory complexity only if it clearly outperforms this simple baseline under the same evaluation protocol.
Separate recent-scene perception from long-range memory in future streaming benchmarks to correctly attribute performance gains from added complexity.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.