---
arxiv_id: "2604.11626"
title: "RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time"
authors:
  - "Haozhe Wang"
  - "Cong Wei"
  - "Weiming Ren"
  - "Jiaming Liu"
  - "Fangzhen Lin"
  - "Wenhu Chen"
difficulty: "intermediate"
tags:
  - "reward-models"
  - "visual-generation"
  - "reinforcement-learning"
  - "test-time-scaling"
  - "VLM"
published_at: "2026-04-13"
flecto_url: "https://flecto.zer0ai.dev/papers/2604.11626/"
lang: "en"
---

### RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

## Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

## Published April 13, 2026

## Abstract

## What This Paper Is About

Most reward models for visual generation compress complex human preferences into a single opaque score, discarding the reasoning behind those preferences. RationalRewards changes this by teaching reward models to produce explicit, multi-dimensional critiques before scoring. Using a novel framework called PARROT (Preference-Anchored Rationalization), the model treats rationales as latent variables inferred from pairwise preference data. This transforms the reward model from a passive evaluator into an active optimization tool that enables two complementary strategies: RL-based fine-tuning in parameter space and Generate-Critique-Refine loops in prompt space. Remarkably, the test-time prompt tuning approach matches or exceeds RL fine-tuning on several benchmarks, without any parameter updates.

## Reasoning-Based Rewards

Multi-dimensional structured critiques replace opaque scalar scores, providing explainable evaluation across text faithfulness, image faithfulness, visual quality, and text rendering.

## PARROT Framework

A variational framework that trains reward models using preference data by treating rationales as latent variables, decomposing the ELBO into three interpretable phases.

## Dual-Space Optimization

Enables both parameter-space tuning via RL (training time) and prompt-space tuning via Generate-Critique-Refine loops (test time), trading compute for quality without parameter updates.

## Introduction

## The Problem with Scalar Reward Models

As visual generation advances toward photorealistic, instruction-following outputs, reward models that evaluate these outputs have become the binding constraint on further progress. Yet most reward models remain scalar black boxes: they compress multi-dimensional human judgments into a single number. This opacity causes two critical problems.

First, reward hacking: models learn to exploit biases in the scalar signal to inflate scores without genuine quality improvement. Second, scalar scores provide no actionable feedback — they tell the generator that something is wrong, but not what or how to fix it. RationalRewards addresses both problems by generating structured, multi-dimensional critiques before deriving scores, enabling the reward model to serve as both evaluator and optimizer.

Parameter Space: Multi-dimensional structured rationales provide semantically grounded, dense feedback for reinforcement learning — replacing opaque scalar gradients prone to reward hacking.

Prompt Space: Natural-language rationales identify concrete deficiencies and translate them into targeted prompt revisions via a Generate-Critique-Refine loop — a purely test-time intervention.

State-of-the-Art: An 8B-parameter model achieves SOTA preference prediction among open-source reward models, competitive with Gemini-2.5-Pro (a far larger proprietary model).

### Figure 1: Benchmark comparison showing RL and Prompt Tuning improvements across three evaluation suites

Figure 1: Train-Time RL and Test-Time Prompt Tuning (PT) with RationalRewards on text and image-to-image generation benchmarks. (a) ImgEdit-Bench Overall: RL with RationalRewards outperforms prior open-source generators. (b) GEdit-Bench-EN: Combined RL+PT reaches 8.33. (c) UniGenBench++: Radar chart showing category-level improvements in text-to-image generation.

## Key Results

Instantiated via PARROT on a Qwen3-VL-Instruct-8B backbone, RationalRewards achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro. As an RL reward, it consistently improves Qwen-Image and Flux-Kontext generators, rivaling GPT-Image-1. The test-time prompt tuning approach matches or exceeds computationally expensive RL fine-tuning on several benchmarks — without any parameter updates.

## Method

## PARROT: Preference-Anchored Rationalization

PARROT trains reward models to produce explicit, multi-dimensional rationales before scores. Assessment dimensions include text faithfulness, physical and visual quality, image faithfulness, and text rendering quality. Since ground-truth rationales are prohibitively expensive to annotate at scale, PARROT formulates rationales as latent variables inferred from pairwise preference data via a variational objective. The resulting ELBO decomposes into three terms, each corresponding to a training phase.

## Figure 4: The PARROT three-phase pipeline showing ELBO decomposition

Figure 4: The PARROT framework. The ELBO is decomposed into three terms, each optimized by a distinct phase. Phase 1 (Hindsight): Teacher VLM generates rationales given the preference label. Phase 2 (Consistency Check): Rationales are verified to be predictively sufficient. Phase 3 (Foresight): Student VLM learns to generate rationales without the preference label.

## Phase 1: Hindsight Rationale Generation

A Teacher VLM is prompted with a comparison tuple (two images + user request) and the ground-truth preference label. This label acts as a preference anchor that steers the teacher’s analysis toward the correct judgment, yielding higher-quality posterior samples than unconditioned generation. The teacher produces structured critiques across four quality dimensions with scores and justifications.

## Phase 2: Predictive Consistency Filtering

While Phase 1 produces linguistically plausible rationales, plausibility alone doesn’t guarantee usefulness. Phase 2 enforces predictive sufficiency: the Teacher is re-queried with the rationale without the preference label, and must correctly predict the original preference. Only rationales that pass this consistency check are retained, filtering out hallucinated or insufficiently informative ones.

## Phase 3: Foresight Student Learning

A smaller Student VLM (8B parameters) is trained via supervised fine-tuning on the filtered rationales to generate critiques without the preference label. This minimizes the KL divergence between posterior and prior, enabling the student to produce scoring rationales at inference time from images alone. The student is trained jointly on pairwise and pointwise data.

## Takeaway: Why Reasoning Rewards Resist Reward Hacking

Scalar reward models compress evaluation into a single number that can be inflated by exploiting biases. Multi-dimensional structured rationales provide an internal consistency mechanism: justifications must align with scores, and scores must be consistent across dimensions. If a model tries to inflate one dimension, the rationale’s justification must explain why — making gaming detectable and penalizable. This structural transparency is what gives reasoning rewards their robustness.

### Figure 3: Comparison of reward hacking with scalar vs rational rewards during RL training

Figure 3: RL training comparison. Top row (RationalRewards): image quality improves steadily with stable reward curves. Bottom row (Scalar Rewards): reward hacking occurs — train reward grows but quality degrades with artifacts and color saturation issues.

## Optimization

## From Evaluator to Optimizer: Dual-Space Optimization

RationalRewards is not just an evaluator — it functions as an active optimization tool across two complementary spaces. This dual formulation connects to test-time compute scaling: prompt-space optimization offers an axis for improving generation quality orthogonal to parameter-space training, applicable to any frozen generator without risk of catastrophic forgetting.

### Figure 2: Architecture overview showing Training Time RL and Test Time Prompt Tuning paths

Figure 2: RationalRewards enables dual-space optimization. (a) Training Time RL: multi-dimensional reward signals provide separate gradient information per quality dimension. (b) Test Time Prompt Tuning: natural-language critiques are translated into refined prompts for re-generation.

## Parameter Space (Training Time)

Multi-dimensional scores provide semantically decomposed reward signals for reinforcement learning. Each quality dimension (text faithfulness, image faithfulness, visual quality, text rendering) provides targeted gradient information, enabling fine-grained optimization rather than chasing a single opaque scalar. This dense feedback helps the generator understand what to improve and why.

## Prompt Space (Test Time)

Natural-language rationales identify concrete deficiencies in generated images — for example, “the instruction says no umbrellas but the image contains one.” These critiques are translated into targeted prompt revisions in a Generate-Critique-Refine (GCR) loop. This purely test-time intervention requires no parameter updates and can be applied to any frozen generator, trading compute for quality.

## Takeaway: Trained Reward Models vs. Generic VLM Judges

Why not just use a capable generic VLM (like Qwen3-VL-32B) as a judge? Beyond the practical advantage of a smaller 8B model, there’s a fundamental reason: structured training on preference data teaches calibrated judgment norms that generic VLMs lack. Generic models can articulate critiques but fail to calibrate severity — they often over-penalize minor issues while under-weighting critical failures. PARROT’s preference-anchored training addresses this by grounding judgments in actual human preferences.

## Test-Time Scaling

## Generate-Critique-Refine: Test-Time Scaling

The Generate-Critique-Refine (GCR) loop provides test-time compute scaling for image generation. After initial generation, RationalRewards critiques the output, identifying specific deficiencies with detailed reasoning. These critiques are translated into prompt revisions, and the generator re-generates. This purely test-time intervention requires no parameter updates and can be applied to any frozen generator, demonstrating that current generators harbor latent capabilities that suboptimal prompts fail to elicit.

## Figure 7: Generate-Critique-Refine loop example with a rain scene prompt

Figure 7: A concrete GCR example. The prompt requests a romantic rain scene with “no umbrellas.” The initial generation violates this constraint. RationalRewards critiques the violation with multi-dimensional reasoning, generating a refined prompt that explicitly reinforces the constraint for re-generation.

## Takeaway: Why Reasoning Rewards Enable Test-Time Scaling

Generators often possess latent capacity for high-quality outputs that is under-elicited by suboptimal prompts. RationalRewards unlocks this capacity without weight modification through actionable critiques. Scalar rewards cannot identify what went wrong, only that something did. Structured reasoning pinpoints the specific deficiency and prescribes concrete prompt corrections — enabling effective test-time compute scaling.

## Results

## Experiments & Results

RationalRewards is evaluated across both image editing and text-to-image generation tasks. Training data includes 30K query-preference pairs from EditReward (image editing) and 50K pairs from ImageRewardDB (text-to-image). The Teacher is Qwen3-VL-32B-Instruct and the Student backbone is Qwen3-VL-8B-Instruct.

## ImgEdit-Bench

## Image Editing (Qwen-Image +RL+PT)

## GEdit-Bench-EN

## General Editing (Qwen-Image +RL+PT)

## UniGenBench++

## Text-to-Image (Qwen-Image [RL])

## Accuracy in Preference Modeling

The 8B-parameter RationalRewards surpasses all open-source scalar reward models by a substantial margin across all three benchmarks (MMRB2, EditReward-Bench, GenAI-Bench), without requiring complex loss designs to handle label noise. It is competitive with Gemini-2.5-Pro, a much larger proprietary model. Ablation shows that PARROT’s variational framework outperforms direct SFT distillation from the same 32B teacher, confirming that the structured training pipeline — not just model capacity — drives the improvement.

## Table 1: Comparison of reward models as evaluators across multiple benchmarks

Table 1: Reward model comparison. RationalRewards (8B) achieves the highest scores across MMRB2, EditReward-Bench, and GenAI-Bench among open-source models.

## Dual-Space Optimization Results

RL with RationalRewards yields consistent improvements across both image editing and text-to-image generation. A striking finding: inference-time prompt tuning frequently matches or exceeds computationally expensive RL. On ImgEdit-Bench, prompt tuning boosts the RL-tuned Flux model from 3.84 to 4.01. This supports the hypothesis that prompt-space optimization offers an orthogonal and complementary axis to parameter-space training.

## Table 2: Text-to-image RL ablation on UniGenBench++

Table 2: Ablation of RationalRewards for text-to-image RL on UniGenBench++, comparing scalar reward (MultiReward) and generic reasoning reward (Qwen3-VL-32B).

## Table 3: Dual-space optimizer ablation on image editing tasks

Table 3: Ablation of RationalRewards as a dual-space optimizer on editing tasks. Prompt tuning (PT) and RL results for both Flux-Kontext and Qwen-Image generators.

## Qualitative Results

## Figure 6: Qualitative comparison of image editing and generation improvements

Figure 6: Qualitative improvements from RL and prompt tuning across diverse editing and generation tasks. Each row shows the source image, base output, and outputs after RL and prompt tuning stages.

## Evaluation

## Pointwise Scoring in Action

### Figure 5: Example pointwise scores across multiple generators on an editing task

Figure 5: RationalRewards evaluates each generated image across multiple quality dimensions (Text Faithfulness, Image Faithfulness, Physical/Visual Quality, Text Rendering) with numerical scores. This multi-dimensional scoring enables targeted feedback for both RL training and prompt refinement.

## Applications

## Four Applications of RationalRewards

## Figure 8: Four application scenarios of RationalRewards

Figure 8: RationalRewards serves as a versatile tool across four complementary application scenarios, from data curation to real-time critique visualization.

## Data Filtering

Automated quality control for data curation using multi-dimensional scores and explainable rationales. Low-quality training examples are identified and filtered with transparency.

## RL Reward Signal

Dense, semantically decomposed reward signals drive fine-grained RL optimization. Each quality dimension provides targeted gradient information instead of a single opaque scalar.

## Prompt Rewriting

The GCR loop trades test-time compute for generation quality. Applicable to any frozen generator without parameter updates or risk of catastrophic forgetting.

## Critique Visualization

Integration with Grounding DINO+SAM localizes identified defects in generated images, providing visual evidence that grounds each critique in specific image regions.

## Context

## Related Work

## Reward Models for Visual Generation

The standard paradigm relies on scalar reward models trained on large-scale human preference datasets. Models such as ImageReward, VideoReward, PickScore, and UnifiedReward output a single scalar score trained with binary or ranking losses. While effective for basic preference prediction, these models discard the reasoning behind their judgments and are vulnerable to reward hacking. Recent works like VLMRM, RM-RL, and Video-SALMONN have begun incorporating reasoning capabilities, but none provide the variational framework and dual-space optimization that RationalRewards introduces.

## Training and Test-Time Scaling in Visual Generation

Recent efforts like FlowGRPO, DanceGRPO, Blip30-Next, and DiffusionNFT have successfully integrated RL into visual generation training. On the test-time side, compute scaling has been extensively explored in language models but remains nascent for visual generation. RationalRewards bridges this gap by demonstrating that structured reasoning enables effective test-time optimization for image generators, complementing parameter-space training with a prompt-space approach.

## Conclusion

## Conclusion

RationalRewards replaces opaque scalar scoring with structured, multi-dimensional chain-of-thought critiques. The PARROT framework makes this tractable by treating rationales as latent variables recoverable from readily available pairwise preference data.

An 8B-parameter model achieves state-of-the-art preference prediction among open-source reward models, competitive with much larger proprietary models. Multi-dimensional rewards resist reward hacking through internal consistency mechanisms that scalar models cannot provide.

Perhaps most remarkably, the Generate-Critique-Refine loop — a purely test-time intervention requiring no parameter updates — matches or exceeds RL-based fine-tuning on several benchmarks. This lends strong empirical support to the hypothesis that current generators harbor latent capabilities that suboptimal prompts fail to elicit, and that structured reasoning is the key to unlocking them.

## Keywords

## References (23 key citations)