Token Warping Helps MLLMs Look from Nearby Viewpoints

Key Contributions

🧩

Token Warping

A training-free method that warps image tokens (not pixels) from a source view to a target viewpoint using depth estimation and camera pose — robust to depth errors that would catastrophically distort pixel-level methods.

🔄

Backward Warping is Best

Systematic comparison of forward and backward token warping reveals that backward warping — defining a dense target-view grid and retrieving source tokens — achieves greater stability and semantic coherence.

🏆

ViewBench Benchmark

A new benchmark with three subtasks (spatial reasoning, shape reasoning, and object description) evaluating MLLM viewpoint reasoning across rotation ranges of 5°–35°.

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

In plain terms: the paper asks whether we can give an MLLM a "mental rotation" ability — making it imagine a scene from a slightly different camera angle — by rearranging the image patches it sees, rather than processing the original image from that angle. The answer is yes, and the key insight is that this rearrangement is forgiving: if you fetch a slightly wrong patch, the MLLM still understands the scene well enough.

Viewpoint Change via Token Warping teaser — Figure 1. Viewpoint Change via Token Warping. Given a source image (View A), backward token warping synthesizes tokens representing the scene from a rotated viewpoint (View B), enabling the MLLM to correctly reason about spatial relationships without pixel synthesis.

Introduction

A core aspect of spatial reasoning from images is understanding the scene's three-dimensional structure. Although depth estimation has achieved near-perfect accuracy, incorporating predicted depth into MLLMs does not yield genuine 3D understanding. Even for simple tasks such as describing the same scene from a different viewpoint, MLLMs fine-tuned with explicit 3D supervision show little improvement. Similar limitations arise in models that incorporate 3D-aware features, which still struggle to reason about viewpoint transformations.

Classical research on mental imagery — from Shepard to Minsky, Pylyshyn, and Hinton — proposes that mental images rely on structural descriptions defined at the part level. From this perspective, image tokens used by Transformer architectures represent a machine-perceivable, part-level representation. It is therefore natural to extend the concept of mental imagery to these perceptual atomic units rather than to object-level abstractions.

Core Hypothesis

Image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes: transformations applied to tokens generate consistent internal representations under viewpoint shifts, improving spatial reasoning — and token-level transformations are robust to the geometric noise that catastrophically degrades pixel-level warping.

We verify this with a patch perturbation experiment: gradually increasing the positional offset when fetching patches shows that MLLMs remain surprisingly stable with perturbed tokens, while pixel-level perturbation causes severe accuracy drops. This provides strong evidence that when constructing tokens from a different viewpoint using an imperfect depth map, the geometric noise introduced does not significantly undermine the MLLM's visual understanding.

Fig. 2. ViT image tokenization: (A) source image → (B) patch grid → (C) image tokens fed to the MLLM. Token warping operates on this representation.

Why tokens and not pixels? When you warp a pixel image with an imperfect depth map, each pixel moves to the wrong place and the result looks broken. But when you warp tokens, a "wrong" position still lands on a nearby part of the scene — the MLLM reads it as a slightly misaligned patch, not as visual garbage. It's like navigating by landmarks instead of GPS coordinates: a few meters off is fine; GPS noise doesn't matter.

Method: Token Warping for Viewpoint Change

A training-free approach using depth estimation and camera pose to construct viewpoint-conditioned image tokens for MLLMs.

Pixel-wise vs Token warping comparison — Figure 4. (A) Pixel-wise warping amplifies small depth errors into severe visual distortions. (B) Token warping maps target grid positions back to the source, where small errors merely retrieve a nearby token — no visible distortion.

Forward vs Backward warping comparison — Figure 3. Forward warping introduces holes and artifacts in the warped view. Backward warping produces a more complete and geometrically coherent result.

Token Robustness to Spatial Noise

Token perturbation robustness chart — Figure 5. Token perturbation robustness. MLLMs maintain high recognition accuracy (green) even with large positional perturbations of patch centers, while pixel-level perturbation (yellow) degrades severely.

We tested MLLM robustness by gradually increasing the positional offset when fetching patches — from 0 to 20px. Even with perturbations approaching the patch size, MLLMs showed only marginal accuracy drops with perturbed tokens. In contrast, pixel-level perturbation at the same scale caused severe degradation. This confirms that the geometric noise introduced by imperfect depth estimation does not significantly undermine token-level visual understanding.

Think of each image patch as a puzzle piece. If you slide a puzzle piece a few millimeters, you can still tell what it depicts. But if you smear or pixelate it (pixel-level distortion), the piece becomes unrecognizable. MLLMs process tokens like puzzle pieces — small position errors don't destroy meaning.

Nearest vs. Adaptive Fetching

Nearest Fetching vs Adaptive Fetching — Figure 7. (A) Nearest Fetching: assigns the pre-computed token nearest to the mapped target location. (B) Adaptive Fetching: re-patchifies the source image at the exact mapped center, capturing finer-grained visual detail.

1

Depth Estimation

Apply an off-the-shelf monocular depth estimator to the source image. The resulting depth map, together with the target camera pose, defines the 3D geometry for token warping.

2

Backward Token Warping

Define a dense grid on the target view. For each target grid point, project it into 3D using the depth map, transform to the target camera frame, and retrieve the corresponding source-view patch (nearest or adaptive fetching).

3

MLLM Inference

Feed the warped token sequence (representing the scene from the target viewpoint) into the MLLM together with the viewpoint-conditioned question. No fine-tuning required — any ViT-based MLLM works out of the box.

The whole pipeline in one sentence: take the source image, estimate depth with any off-the-shelf model, then for each position in the target-view grid, find where that point came from in the source image and grab the source patch there — then feed these re-arranged patches (as tokens) to the MLLM and ask your question.

ViewBench: A New Benchmark for Viewpoint Reasoning

We introduce ViewBench, a benchmark specifically designed to evaluate the ability of MLLMs to reason about scenes from nearby viewpoints. It covers three complementary subtasks and three rotation ranges (5°–15°, 15°–25°, 25°–35°).

💬

ViewBench-Text

View-conditioned spatial reasoning: binary left/right questions about object spatial relationships from a target viewpoint. Tests whether MLLMs can correctly orient their spatial reasoning to the rotated perspective.

📐

ViewBench-Shape

Shape identification from the target view: the model must identify the correct shape of an object as it appears from the rotated viewpoint, testing geometric perspective understanding.

👁️

ViewBench-Object

Target-view object description: open-ended description of how an object appears from the target viewpoint. Evaluated by an LLM judge on a -10 to +10 similarity scale, rewarding descriptions that match the actual target-view appearance.

ViewBench benchmark examples — Figure 6. ViewBench examples across three subtasks: spatial reasoning (left), shape identification (center), and target-view object description (right). Green checkmarks indicate correct responses.

ViewBench is designed so that questions have definitive ground-truth answers derivable from geometry. The "object description" task uses an LLM judge — not a binary answer — because the target view might reveal new object features not visible from the source. The -10 to +10 scale rewards answers that describe what actually becomes visible from the new angle.

Experiments & Results

Token Warping (Backward-Adaptive) consistently outperforms all baselines across all three ViewBench subtasks and all rotation ranges (5°–35°), surpassing pixel-wise warping, spatially fine-tuned MLLMs, novel view synthesis, and generative warping methods — without any training.

Table 1: Main Results on ViewBench

Main quantitative results on ViewBench — Table 1. Accuracy on ViewBench across three subtasks (ViewBench-Text, ViewBench-Shape, ViewBench-Object) and three rotation ranges. Backward Token Warping (Adaptive) achieves the highest scores, especially on larger rotations.

Qualitative Comparison

Comparison with Fine-Tuned Baselines

Comparison with fine-tuned MLLM baselines — Table 2. Comparison with spatially fine-tuned MLLM baselines. Despite requiring no training, token warping surpasses dedicated spatially-supervised models.

Why does training-free token warping beat models that were explicitly trained on 3D spatial tasks? Because spatial fine-tuning teaches the model what typical views look like, but doesn't fix the underlying lack of viewpoint transformation ability. Token warping actively synthesizes the target viewpoint — it gives the model the correct visual input rather than hoping it can imagine the transformation.

Conclusion

This paper introduces token warping as a training-free approach to enable viewpoint-conditioned visual reasoning in MLLMs. Our key findings include:

Image tokens in ViT-based MLLMs are robust to spatial perturbations, making them an effective substrate for viewpoint transformation.
Backward token warping with adaptive fetching consistently outperforms forward warping and all pixel-wise, fine-tuning, and generative baselines.
ViewBench, our proposed benchmark, provides a comprehensive evaluation framework for viewpoint-conditioned MLLM reasoning across spatial, shape, and object description tasks.
The approach connects naturally to cognitive theories of mental imagery, suggesting that part-level token representations in neural networks mirror the structural representations proposed to underlie human perspective reasoning.

Take-away

Token-level mental imagery — rather than pixel manipulation or explicit 3D reconstruction — is a promising and practical path toward robust spatial reasoning in multimodal AI systems.

Citation

@article{lee2026tokenwarping,
  title={Token Warping Helps MLLMs Look from Nearby Viewpoints},
  author={Lee, Phillip Y. and Park, Chanho and Park, Mingue
          and Yoo, Seungwoo and Koo, Juil and Sung, Minhyuk},
  journal={arXiv preprint arXiv:2604.02870},
  year={2026}
}