KAIST ย ย |ย ย * Equal contribution
A training-free method that warps image tokens (not pixels) from a source view to a target viewpoint using depth estimation and camera pose โ robust to depth errors that would catastrophically distort pixel-level methods.
Systematic comparison of forward and backward token warping reveals that backward warping โ defining a dense target-view grid and retrieving source tokens โ achieves greater stability and semantic coherence.
A new benchmark with three subtasks (spatial reasoning, shape reasoning, and object description) evaluating MLLM viewpoint reasoning across rotation ranges of 5ยฐโ35ยฐ.
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
A core aspect of spatial reasoning from images is understanding the scene's three-dimensional structure. Although depth estimation has achieved near-perfect accuracy, incorporating predicted depth into MLLMs does not yield genuine 3D understanding. Even for simple tasks such as describing the same scene from a different viewpoint, MLLMs fine-tuned with explicit 3D supervision show little improvement. Similar limitations arise in models that incorporate 3D-aware features, which still struggle to reason about viewpoint transformations.
Classical research on mental imagery โ from Shepard to Minsky, Pylyshyn, and Hinton โ proposes that mental images rely on structural descriptions defined at the part level. From this perspective, image tokens used by Transformer architectures represent a machine-perceivable, part-level representation. It is therefore natural to extend the concept of mental imagery to these perceptual atomic units rather than to object-level abstractions.
Image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes: transformations applied to tokens generate consistent internal representations under viewpoint shifts, improving spatial reasoning โ and token-level transformations are robust to the geometric noise that catastrophically degrades pixel-level warping.
We verify this with a patch perturbation experiment: gradually increasing the positional offset when fetching patches shows that MLLMs remain surprisingly stable with perturbed tokens, while pixel-level perturbation causes severe accuracy drops. This provides strong evidence that when constructing tokens from a different viewpoint using an imperfect depth map, the geometric noise introduced does not significantly undermine the MLLM's visual understanding.
A training-free approach using depth estimation and camera pose to construct viewpoint-conditioned image tokens for MLLMs.
We tested MLLM robustness by gradually increasing the positional offset when fetching patches โ from 0 to 20px. Even with perturbations approaching the patch size, MLLMs showed only marginal accuracy drops with perturbed tokens. In contrast, pixel-level perturbation at the same scale caused severe degradation. This confirms that the geometric noise introduced by imperfect depth estimation does not significantly undermine token-level visual understanding.
Apply an off-the-shelf monocular depth estimator to the source image. The resulting depth map, together with the target camera pose, defines the 3D geometry for token warping.
Define a dense grid on the target view. For each target grid point, project it into 3D using the depth map, transform to the target camera frame, and retrieve the corresponding source-view patch (nearest or adaptive fetching).
Feed the warped token sequence (representing the scene from the target viewpoint) into the MLLM together with the viewpoint-conditioned question. No fine-tuning required โ any ViT-based MLLM works out of the box.
We introduce ViewBench, a benchmark specifically designed to evaluate the ability of MLLMs to reason about scenes from nearby viewpoints. It covers three complementary subtasks and three rotation ranges (5ยฐโ15ยฐ, 15ยฐโ25ยฐ, 25ยฐโ35ยฐ).
View-conditioned spatial reasoning: binary left/right questions about object spatial relationships from a target viewpoint. Tests whether MLLMs can correctly orient their spatial reasoning to the rotated perspective.
Shape identification from the target view: the model must identify the correct shape of an object as it appears from the rotated viewpoint, testing geometric perspective understanding.
Target-view object description: open-ended description of how an object appears from the target viewpoint. Evaluated by an LLM judge on a -10 to +10 similarity scale, rewarding descriptions that match the actual target-view appearance.
Token Warping (Backward-Adaptive) consistently outperforms all baselines across all three ViewBench subtasks and all rotation ranges (5ยฐโ35ยฐ), surpassing pixel-wise warping, spatially fine-tuned MLLMs, novel view synthesis, and generative warping methods โ without any training.
This paper introduces token warping as a training-free approach to enable viewpoint-conditioned visual reasoning in MLLMs. Our key findings include:
Token-level mental imagery โ rather than pixel manipulation or explicit 3D reconstruction โ is a promising and practical path toward robust spatial reasoning in multimodal AI systems.
@article{lee2026tokenwarping,
title={Token Warping Helps MLLMs Look from Nearby Viewpoints},
author={Lee, Phillip Y. and Park, Chanho and Park, Mingue
and Yoo, Seungwoo and Koo, Juil and Sung, Minhyuk},
journal={arXiv preprint arXiv:2604.02870},
year={2026}
}
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.