---
arxiv_id: 2604.02870
title: "Token Warping Helps MLLMs Look from Nearby Viewpoints"
authors:
  - Phillip Y. Lee
  - Chanho Park
  - Mingue Park
  - Seungwoo Yoo
  - Juil Koo
  - Minhyuk Sung
difficulty: Intermediate
tags:
  - Multimodal
  - Spatial Reasoning
  - Viewpoint
  - Token Warping
  - Benchmark
published_at: 2026-04-03
flecto_url: https://flecto.zer0ai.dev/papers/2604.02870/
lang: en
---

> Token Warping Helps MLLMs Look from Nearby Viewpoints

**Authors**: Phillip Y. Lee*, Chanho Park*, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung

## Abstract

### Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

## Introduction

### Introduction

A core aspect of spatial reasoning from images is understanding the scene's three-dimensional structure. Although depth estimation has achieved near-perfect accuracy, incorporating predicted depth into MLLMs does not yield genuine 3D understanding. Even for simple tasks such as describing the same scene from a different viewpoint, MLLMs fine-tuned with explicit 3D supervision show little improvement. Similar limitations arise in models that incorporate 3D-aware features, which still struggle to reason about viewpoint transformations.

Classical research on mental imagery — from Shepard to Minsky, Pylyshyn, and Hinton — proposes that mental images rely on structural descriptions defined at the part level. From this perspective, image tokens used by Transformer architectures represent a machine-perceivable, part-level representation. It is therefore natural to extend the concept of mental imagery to these perceptual atomic units rather than to object-level abstractions.

We verify this with a patch perturbation experiment: gradually increasing the positional offset when fetching patches shows that MLLMs remain surprisingly stable with perturbed tokens, while pixel-level perturbation causes severe accuracy drops. This provides strong evidence that when constructing tokens from a different viewpoint using an imperfect depth map, the geometric noise introduced does not significantly undermine the MLLM's visual understanding.

## Results

### Experiments & Results

### Table 1: Main Results on ViewBench

### Qualitative Comparison

### Comparison with Fine-Tuned Baselines

## Conclusion

### Conclusion

This paper introduces token warping as a training-free approach to enable viewpoint-conditioned visual reasoning in MLLMs. Our key findings include:

Image tokens in ViT-based MLLMs are robust to spatial perturbations, making them an effective substrate for viewpoint transformation.

Backward token warping with adaptive fetching consistently outperforms forward warping and all pixel-wise, fine-tuning, and generative baselines.

ViewBench, our proposed benchmark, provides a comprehensive evaluation framework for viewpoint-conditioned MLLM reasoning across spatial, shape, and object description tasks.

The approach connects naturally to cognitive theories of mental imagery, suggesting that part-level token representations in neural networks mirror the structural representations proposed to underlie human perspective reasoning.

### Take-away

Token-level mental imagery — rather than pixel manipulation or explicit 3D reconstruction — is a promising and practical path toward robust spatial reasoning in multimodal AI systems.

## Head Title

### Token Warping Helps MLLMs Look from Nearby Viewpoints | Flecto

## Head Meta

Token Warping enables MLLMs to reason about scenes from nearby viewpoints by warping image tokens instead of pixels — a training-free approach that outperforms pixel warping, fine-tuned MLLMs, and generative warping on the new ViewBench benchmark.

## Hero Button

### Read on arXiv ↗

### Project Page ↗

## Contributions

### Key Contributions

## Contributions Card 1

### Token Warping

A training-free method that warps image tokens (not pixels) from a source view to a target viewpoint using depth estimation and camera pose — robust to depth errors that would catastrophically distort pixel-level methods.

## Contributions Card 2

### Backward Warping is Best

Systematic comparison of forward and backward token warping reveals that backward warping — defining a dense target-view grid and retrieving source tokens — achieves greater stability and semantic coherence.

## Contributions Card 3

### ViewBench Benchmark

A new benchmark with three subtasks (spatial reasoning, shape reasoning, and object description) evaluating MLLM viewpoint reasoning across rotation ranges of 5°–35°.

## Teaser Figure

Figure 1. Viewpoint Change via Token Warping. Given a source image (View A), backward token warping synthesizes tokens representing the scene from a rotated viewpoint (View B), enabling the MLLM to correctly reason about spatial relationships without pixel synthesis.

## Introduction Callout

### Core Hypothesis

Image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes: transformations applied to tokens generate consistent internal representations under viewpoint shifts, improving spatial reasoning — and token-level transformations are robust to the geometric noise that catastrophically degrades pixel-level warping.

## Introduction Figure

Fig. 2. ViT image tokenization: (A) source image → (B) patch grid → (C) image tokens fed to the MLLM. Token warping operates on this representation.

## Method

### Method: Token Warping for Viewpoint Change

A training-free approach using depth estimation and camera pose to construct viewpoint-conditioned image tokens for MLLMs.

### Token Robustness to Spatial Noise

We tested MLLM robustness by gradually increasing the positional offset when fetching patches — from 0 to 20px. Even with perturbations approaching the patch size, MLLMs showed only marginal accuracy drops with perturbed tokens. In contrast, pixel-level perturbation at the same scale caused severe degradation. This confirms that the geometric noise introduced by imperfect depth estimation does not significantly undermine token-level visual understanding.

### Nearest vs. Adaptive Fetching

## Method Figure

Figure 4. (A) Pixel-wise warping amplifies small depth errors into severe visual distortions. (B) Token warping maps target grid positions back to the source, where small errors merely retrieve a nearby token — no visible distortion.

Figure 3. Forward warping introduces holes and artifacts in the warped view. Backward warping produces a more complete and geometrically coherent result.

Figure 5. Token perturbation robustness. MLLMs maintain high recognition accuracy (green) even with large positional perturbations of patch centers, while pixel-level perturbation (yellow) degrades severely.

Figure 7. (A) Nearest Fetching: assigns the pre-computed token nearest to the mapped target location. (B) Adaptive Fetching: re-patchifies the source image at the exact mapped center, capturing finer-grained visual detail.

## Method Steps

### Depth Estimation

Apply an off-the-shelf monocular depth estimator to the source image. The resulting depth map, together with the target camera pose, defines the 3D geometry for token warping.

### Backward Token Warping

Define a dense grid on the target view. For each target grid point, project it into 3D using the depth map, transform to the target camera frame, and retrieve the corresponding source-view patch (nearest or adaptive fetching).

### MLLM Inference

Feed the warped token sequence (representing the scene from the target viewpoint) into the MLLM together with the viewpoint-conditioned question. No fine-tuning required — any ViT-based MLLM works out of the box.

## Viewbench

### ViewBench: A New Benchmark for Viewpoint Reasoning

We introduce ViewBench, a benchmark specifically designed to evaluate the ability of MLLMs to reason about scenes from nearby viewpoints. It covers three complementary subtasks and three rotation ranges (5°–15°, 15°–25°, 25°–35°).

## Viewbench Card 1

### ViewBench-Text

View-conditioned spatial reasoning: binary left/right questions about object spatial relationships from a target viewpoint. Tests whether MLLMs can correctly orient their spatial reasoning to the rotated perspective.

## Viewbench Card 2

### ViewBench-Shape

Shape identification from the target view: the model must identify the correct shape of an object as it appears from the rotated viewpoint, testing geometric perspective understanding.

## Viewbench Card 3

### ViewBench-Object

Target-view object description: open-ended description of how an object appears from the target viewpoint. Evaluated by an LLM judge on a -10 to +10 similarity scale, rewarding descriptions that match the actual target-view appearance.

## Viewbench Figure

Figure 6. ViewBench examples across three subtasks: spatial reasoning (left), shape identification (center), and target-view object description (right). Green checkmarks indicate correct responses.

## Results Highlight

Token Warping (Backward-Adaptive) consistently outperforms all baselines across all three ViewBench subtasks and all rotation ranges (5°–35°), surpassing pixel-wise warping, spatially fine-tuned MLLMs, novel view synthesis, and generative warping methods — without any training.

## Results Table

Table 1. Accuracy on ViewBench across three subtasks (ViewBench-Text, ViewBench-Shape, ViewBench-Object) and three rotation ranges. Backward Token Warping (Adaptive) achieves the highest scores, especially on larger rotations.

Table 2. Comparison with spatially fine-tuned MLLM baselines. Despite requiring no training, token warping surpasses dedicated spatially-supervised models.

## Results Figure

Figure 8. Qualitative comparison on ViewBench-Text examples. Token Warping (Backward-Adaptive) correctly answers spatial questions from the target viewpoint, while pixel-wise warping and other baselines frequently give incorrect responses.

## Citation

### Citation

## Abstract Flecto Note

In plain terms: the paper asks whether we can give an MLLM a "mental rotation" ability — making it imagine a scene from a slightly different camera angle — by rearranging the image patches it sees, rather than processing the original image from that angle. The answer is yes, and the key insight is that this rearrangement is forgiving: if you fetch a slightly wrong patch, the MLLM still understands the scene well enough.

## Introduction Flecto Callout

Why tokens and not pixels? When you warp a pixel image with an imperfect depth map, each pixel moves to the wrong place and the result looks broken. But when you warp tokens, a "wrong" position still lands on a nearby part of the scene — the MLLM reads it as a slightly misaligned patch, not as visual garbage. It's like navigating by landmarks instead of GPS coordinates: a few meters off is fine; GPS noise doesn't matter.

## Method Flecto Note

Think of each image patch as a puzzle piece. If you slide a puzzle piece a few millimeters, you can still tell what it depicts. But if you smear or pixelate it (pixel-level distortion), the piece becomes unrecognizable. MLLMs process tokens like puzzle pieces — small position errors don't destroy meaning.

## Method Flecto Card

The whole pipeline in one sentence: take the source image, estimate depth with any off-the-shelf model, then for each position in the target-view grid, find where that point came from in the source image and grab the source patch there — then feed these re-arranged patches (as tokens) to the MLLM and ask your question.

## Viewbench Flecto Panel

ViewBench is designed so that questions have definitive ground-truth answers derivable from geometry. The "object description" task uses an LLM judge — not a binary answer — because the target view might reveal new object features not visible from the source. The -10 to +10 scale rewards answers that describe what actually becomes visible from the new angle.

## Results Flecto Callout

Why does training-free token warping beat models that were explicitly trained on 3D spatial tasks? Because spatial fine-tuning teaches the model what typical views look like, but doesn't fix the underlying lack of viewpoint transformation ability. Token warping actively synthesizes the target viewpoint — it gives the model the correct visual input rather than hoping it can imagine the transformation.
