โ† Flecto๐Ÿค– Agent Ready
Token Warping ยท MLLMs ยท Viewpoint Reasoning ยท ViewBench

Token Warping Helps MLLMs Look from Nearby Viewpoints

Phillip Y. Lee*, Chanho Park*, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung

KAIST ย ย |ย ย  * Equal contribution

Can warping tokens, rather than pixels, help MLLMs understand how a scene looks from a nearby viewpoint? We show that backward token warping consistently outperforms all baselines โ€” without any fine-tuning.

Key Contributions

๐Ÿงฉ

Token Warping

A training-free method that warps image tokens (not pixels) from a source view to a target viewpoint using depth estimation and camera pose โ€” robust to depth errors that would catastrophically distort pixel-level methods.

๐Ÿ”„

Backward Warping is Best

Systematic comparison of forward and backward token warping reveals that backward warping โ€” defining a dense target-view grid and retrieving source tokens โ€” achieves greater stability and semantic coherence.

๐Ÿ†

ViewBench Benchmark

A new benchmark with three subtasks (spatial reasoning, shape reasoning, and object description) evaluating MLLM viewpoint reasoning across rotation ranges of 5ยฐโ€“35ยฐ.

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.
In plain terms: the paper asks whether we can give an MLLM a "mental rotation" ability โ€” making it imagine a scene from a slightly different camera angle โ€” by rearranging the image patches it sees, rather than processing the original image from that angle. The answer is yes, and the key insight is that this rearrangement is forgiving: if you fetch a slightly wrong patch, the MLLM still understands the scene well enough.
Viewpoint Change via Token Warping teaser
Figure 1. Viewpoint Change via Token Warping. Given a source image (View A), backward token warping synthesizes tokens representing the scene from a rotated viewpoint (View B), enabling the MLLM to correctly reason about spatial relationships without pixel synthesis.

Introduction

A core aspect of spatial reasoning from images is understanding the scene's three-dimensional structure. Although depth estimation has achieved near-perfect accuracy, incorporating predicted depth into MLLMs does not yield genuine 3D understanding. Even for simple tasks such as describing the same scene from a different viewpoint, MLLMs fine-tuned with explicit 3D supervision show little improvement. Similar limitations arise in models that incorporate 3D-aware features, which still struggle to reason about viewpoint transformations.

Classical research on mental imagery โ€” from Shepard to Minsky, Pylyshyn, and Hinton โ€” proposes that mental images rely on structural descriptions defined at the part level. From this perspective, image tokens used by Transformer architectures represent a machine-perceivable, part-level representation. It is therefore natural to extend the concept of mental imagery to these perceptual atomic units rather than to object-level abstractions.

Core Hypothesis

Image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes: transformations applied to tokens generate consistent internal representations under viewpoint shifts, improving spatial reasoning โ€” and token-level transformations are robust to the geometric noise that catastrophically degrades pixel-level warping.

We verify this with a patch perturbation experiment: gradually increasing the positional offset when fetching patches shows that MLLMs remain surprisingly stable with perturbed tokens, while pixel-level perturbation causes severe accuracy drops. This provides strong evidence that when constructing tokens from a different viewpoint using an imperfect depth map, the geometric noise introduced does not significantly undermine the MLLM's visual understanding.

ViT image tokenization
Fig. 2. ViT image tokenization: (A) source image โ†’ (B) patch grid โ†’ (C) image tokens fed to the MLLM. Token warping operates on this representation.
Why tokens and not pixels? When you warp a pixel image with an imperfect depth map, each pixel moves to the wrong place and the result looks broken. But when you warp tokens, a "wrong" position still lands on a nearby part of the scene โ€” the MLLM reads it as a slightly misaligned patch, not as visual garbage. It's like navigating by landmarks instead of GPS coordinates: a few meters off is fine; GPS noise doesn't matter.

Method: Token Warping for Viewpoint Change

A training-free approach using depth estimation and camera pose to construct viewpoint-conditioned image tokens for MLLMs.

Pixel-wise vs Token warping comparison
Figure 4. (A) Pixel-wise warping amplifies small depth errors into severe visual distortions. (B) Token warping maps target grid positions back to the source, where small errors merely retrieve a nearby token โ€” no visible distortion.
Forward vs Backward warping comparison
Figure 3. Forward warping introduces holes and artifacts in the warped view. Backward warping produces a more complete and geometrically coherent result.

Token Robustness to Spatial Noise

Token perturbation robustness chart
Figure 5. Token perturbation robustness. MLLMs maintain high recognition accuracy (green) even with large positional perturbations of patch centers, while pixel-level perturbation (yellow) degrades severely.

We tested MLLM robustness by gradually increasing the positional offset when fetching patches โ€” from 0 to 20px. Even with perturbations approaching the patch size, MLLMs showed only marginal accuracy drops with perturbed tokens. In contrast, pixel-level perturbation at the same scale caused severe degradation. This confirms that the geometric noise introduced by imperfect depth estimation does not significantly undermine token-level visual understanding.

Think of each image patch as a puzzle piece. If you slide a puzzle piece a few millimeters, you can still tell what it depicts. But if you smear or pixelate it (pixel-level distortion), the piece becomes unrecognizable. MLLMs process tokens like puzzle pieces โ€” small position errors don't destroy meaning.

Nearest vs. Adaptive Fetching

Nearest Fetching vs Adaptive Fetching
Figure 7. (A) Nearest Fetching: assigns the pre-computed token nearest to the mapped target location. (B) Adaptive Fetching: re-patchifies the source image at the exact mapped center, capturing finer-grained visual detail.
1

Depth Estimation

Apply an off-the-shelf monocular depth estimator to the source image. The resulting depth map, together with the target camera pose, defines the 3D geometry for token warping.

2

Backward Token Warping

Define a dense grid on the target view. For each target grid point, project it into 3D using the depth map, transform to the target camera frame, and retrieve the corresponding source-view patch (nearest or adaptive fetching).

3

MLLM Inference

Feed the warped token sequence (representing the scene from the target viewpoint) into the MLLM together with the viewpoint-conditioned question. No fine-tuning required โ€” any ViT-based MLLM works out of the box.

The whole pipeline in one sentence: take the source image, estimate depth with any off-the-shelf model, then for each position in the target-view grid, find where that point came from in the source image and grab the source patch there โ€” then feed these re-arranged patches (as tokens) to the MLLM and ask your question.

ViewBench: A New Benchmark for Viewpoint Reasoning

We introduce ViewBench, a benchmark specifically designed to evaluate the ability of MLLMs to reason about scenes from nearby viewpoints. It covers three complementary subtasks and three rotation ranges (5ยฐโ€“15ยฐ, 15ยฐโ€“25ยฐ, 25ยฐโ€“35ยฐ).

๐Ÿ’ฌ

ViewBench-Text

View-conditioned spatial reasoning: binary left/right questions about object spatial relationships from a target viewpoint. Tests whether MLLMs can correctly orient their spatial reasoning to the rotated perspective.

๐Ÿ“

ViewBench-Shape

Shape identification from the target view: the model must identify the correct shape of an object as it appears from the rotated viewpoint, testing geometric perspective understanding.

๐Ÿ‘๏ธ

ViewBench-Object

Target-view object description: open-ended description of how an object appears from the target viewpoint. Evaluated by an LLM judge on a -10 to +10 similarity scale, rewarding descriptions that match the actual target-view appearance.

ViewBench benchmark examples
Figure 6. ViewBench examples across three subtasks: spatial reasoning (left), shape identification (center), and target-view object description (right). Green checkmarks indicate correct responses.
ViewBench is designed so that questions have definitive ground-truth answers derivable from geometry. The "object description" task uses an LLM judge โ€” not a binary answer โ€” because the target view might reveal new object features not visible from the source. The -10 to +10 scale rewards answers that describe what actually becomes visible from the new angle.

Experiments & Results

Token Warping (Backward-Adaptive) consistently outperforms all baselines across all three ViewBench subtasks and all rotation ranges (5ยฐโ€“35ยฐ), surpassing pixel-wise warping, spatially fine-tuned MLLMs, novel view synthesis, and generative warping methods โ€” without any training.

Table 1: Main Results on ViewBench

Main quantitative results on ViewBench
Table 1. Accuracy on ViewBench across three subtasks (ViewBench-Text, ViewBench-Shape, ViewBench-Object) and three rotation ranges. Backward Token Warping (Adaptive) achieves the highest scores, especially on larger rotations.

Qualitative Comparison

Qualitative comparison across methods
Figure 8. Qualitative comparison on ViewBench-Text examples. Token Warping (Backward-Adaptive) correctly answers spatial questions from the target viewpoint, while pixel-wise warping and other baselines frequently give incorrect responses.

Comparison with Fine-Tuned Baselines

Comparison with fine-tuned MLLM baselines
Table 2. Comparison with spatially fine-tuned MLLM baselines. Despite requiring no training, token warping surpasses dedicated spatially-supervised models.
Why does training-free token warping beat models that were explicitly trained on 3D spatial tasks? Because spatial fine-tuning teaches the model what typical views look like, but doesn't fix the underlying lack of viewpoint transformation ability. Token warping actively synthesizes the target viewpoint โ€” it gives the model the correct visual input rather than hoping it can imagine the transformation.

Conclusion

This paper introduces token warping as a training-free approach to enable viewpoint-conditioned visual reasoning in MLLMs. Our key findings include:

Take-away

Token-level mental imagery โ€” rather than pixel manipulation or explicit 3D reconstruction โ€” is a promising and practical path toward robust spatial reasoning in multimodal AI systems.

Citation

@article{lee2026tokenwarping,
  title={Token Warping Helps MLLMs Look from Nearby Viewpoints},
  author={Lee, Phillip Y. and Park, Chanho and Park, Mingue
          and Yoo, Seungwoo and Koo, Juil and Sung, Minhyuk},
  journal={arXiv preprint arXiv:2604.02870},
  year={2026}
}

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages โ€” we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.

View Services Contact Us