RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Abstract

What This Paper Is About

Most reward models for visual generation compress complex human preferences into a single opaque score, discarding the reasoning behind those preferences. RationalRewards changes this by teaching reward models to produce explicit, multi-dimensional critiques before scoring. Using a novel framework called PARROT (Preference-Anchored Rationalization), the model treats rationales as latent variables inferred from pairwise preference data. This transforms the reward model from a passive evaluator into an active optimization tool that enables two complementary strategies: RL-based fine-tuning in parameter space and Generate-Critique-Refine loops in prompt space. Remarkably, the test-time prompt tuning approach matches or exceeds RL fine-tuning on several benchmarks, without any parameter updates.

🧠

Reasoning-Based Rewards

Multi-dimensional structured critiques replace opaque scalar scores, providing explainable evaluation across text faithfulness, image faithfulness, visual quality, and text rendering.

🦜

PARROT Framework

A variational framework that trains reward models using preference data by treating rationales as latent variables, decomposing the ELBO into three interpretable phases.

🔄

Dual-Space Optimization

Enables both parameter-space tuning via RL (training time) and prompt-space tuning via Generate-Critique-Refine loops (test time), trading compute for quality without parameter updates.

Introduction

The Problem with Scalar Reward Models

As visual generation advances toward photorealistic, instruction-following outputs, reward models that evaluate these outputs have become the binding constraint on further progress. Yet most reward models remain scalar black boxes: they compress multi-dimensional human judgments into a single number. This opacity causes two critical problems.

First, reward hacking: models learn to exploit biases in the scalar signal to inflate scores without genuine quality improvement. Second, scalar scores provide no actionable feedback — they tell the generator that something is wrong, but not what or how to fix it. RationalRewards addresses both problems by generating structured, multi-dimensional critiques before deriving scores, enabling the reward model to serve as both evaluator and optimizer.

Parameter Space: Multi-dimensional structured rationales provide semantically grounded, dense feedback for reinforcement learning — replacing opaque scalar gradients prone to reward hacking.
Prompt Space: Natural-language rationales identify concrete deficiencies and translate them into targeted prompt revisions via a Generate-Critique-Refine loop — a purely test-time intervention.
State-of-the-Art: An 8B-parameter model achieves SOTA preference prediction among open-source reward models, competitive with Gemini-2.5-Pro (a far larger proprietary model).

Figure 1: Benchmark comparison showing RL and Prompt Tuning improvements across three evaluation suites — **Figure 1:** Train-Time RL and Test-Time Prompt Tuning (PT) with RationalRewards on text and image-to-image generation benchmarks. (a) ImgEdit-Bench Overall: RL with RationalRewards outperforms prior open-source generators. (b) GEdit-Bench-EN: Combined RL+PT reaches 8.33. (c) UniGenBench++: Radar chart showing category-level improvements in text-to-image generation.

Key Results

Instantiated via PARROT on a Qwen3-VL-Instruct-8B backbone, RationalRewards achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro. As an RL reward, it consistently improves Qwen-Image and Flux-Kontext generators, rivaling GPT-Image-1. The test-time prompt tuning approach matches or exceeds computationally expensive RL fine-tuning on several benchmarks — without any parameter updates.

Method

PARROT: Preference-Anchored Rationalization

PARROT trains reward models to produce explicit, multi-dimensional rationales before scores. Assessment dimensions include text faithfulness, physical and visual quality, image faithfulness, and text rendering quality. Since ground-truth rationales are prohibitively expensive to annotate at scale, PARROT formulates rationales as latent variables inferred from pairwise preference data via a variational objective. The resulting ELBO decomposes into three terms, each corresponding to a training phase.

What is the ELBO and why does it matter here?

The ELBO (Evidence Lower BOund) is a mathematical tool from variational inference. Think of it like this: imagine you want to understand why humans prefer one image over another, but you can't directly observe their reasoning process. The ELBO gives you a way to learn those hidden reasons (rationales) from the data you can observe (which image was preferred).

PARROT decomposes this into three intuitive terms: (1) Do the rationales actually explain the preferences? (2) How different are our inferred rationales from what we'd generate without knowing the answer? (3) Can a student model learn to produce these rationales on its own?

Latent Variables: The Key Insight

A latent variable is something you believe exists but can't directly observe. Here, the "rationale" (the reasoning behind why one image is better) is latent — humans express preferences but rarely write down their full reasoning process. PARROT's insight is to treat these unobserved rationales as latent variables and learn them from preference data, rather than requiring expensive human annotation of reasoning.

Real-world analogy: It's like learning what makes a good restaurant review by only seeing star ratings (preferences), then training a model to write the detailed review (rationale) that would lead to that rating.

Figure 4: The PARROT three-phase pipeline showing ELBO decomposition — **Figure 4:** The PARROT framework. The ELBO is decomposed into three terms, each optimized by a distinct phase. Phase 1 (Hindsight): Teacher VLM generates rationales given the preference label. Phase 2 (Consistency Check): Rationales are verified to be predictively sufficient. Phase 3 (Foresight): Student VLM learns to generate rationales without the preference label.

Phase 1: Hindsight Rationale Generation

A Teacher VLM is prompted with a comparison tuple (two images + user request) and the ground-truth preference label. This label acts as a preference anchor that steers the teacher’s analysis toward the correct judgment, yielding higher-quality posterior samples than unconditioned generation. The teacher produces structured critiques across four quality dimensions with scores and justifications.

Phase 2: Predictive Consistency Filtering

While Phase 1 produces linguistically plausible rationales, plausibility alone doesn’t guarantee usefulness. Phase 2 enforces predictive sufficiency: the Teacher is re-queried with the rationale without the preference label, and must correctly predict the original preference. Only rationales that pass this consistency check are retained, filtering out hallucinated or insufficiently informative ones.

Predictive sufficiency means the rationale alone should contain enough information to predict the original preference. If a critique says "Image A has better lighting" but you can't tell from that which image was preferred, the rationale is not sufficient — it's either hallucinated or too vague.

Phase 3: Foresight Student Learning

A smaller Student VLM (8B parameters) is trained via supervised fine-tuning on the filtered rationales to generate critiques without the preference label. This minimizes the KL divergence between posterior and prior, enabling the student to produce scoring rationales at inference time from images alone. The student is trained jointly on pairwise and pointwise data.

Takeaway: Why Reasoning Rewards Resist Reward Hacking

Scalar reward models compress evaluation into a single number that can be inflated by exploiting biases. Multi-dimensional structured rationales provide an internal consistency mechanism: justifications must align with scores, and scores must be consistent across dimensions. If a model tries to inflate one dimension, the rationale’s justification must explain why — making gaming detectable and penalizable. This structural transparency is what gives reasoning rewards their robustness.

What is Reward Hacking?

Reward hacking occurs when an AI model finds shortcuts to maximize its reward score without actually improving quality. For example, a scalar reward model might give higher scores to images with high color saturation, so the generator learns to oversaturate all images — inflating scores while degrading visual quality.

RationalRewards prevents this because its multi-dimensional scoring creates checks and balances: if "text faithfulness" is high but "visual quality" is low, the discrepancy is visible and penalizable.

Figure 3: Comparison of reward hacking with scalar vs rational rewards during RL training — **Figure 3:** RL training comparison. *Top row* (RationalRewards): image quality improves steadily with stable reward curves. *Bottom row* (Scalar Rewards): reward hacking occurs — train reward grows but quality degrades with artifacts and color saturation issues.

Optimization

From Evaluator to Optimizer: Dual-Space Optimization

RationalRewards is not just an evaluator — it functions as an active optimization tool across two complementary spaces. This dual formulation connects to test-time compute scaling: prompt-space optimization offers an axis for improving generation quality orthogonal to parameter-space training, applicable to any frozen generator without risk of catastrophic forgetting.

Figure 2: Architecture overview showing Training Time RL and Test Time Prompt Tuning paths — **Figure 2:** RationalRewards enables dual-space optimization. (a) *Training Time RL*: multi-dimensional reward signals provide separate gradient information per quality dimension. (b) *Test Time Prompt Tuning*: natural-language critiques are translated into refined prompts for re-generation.

Two Axes of Improvement

Parameter-space optimization (RL) changes the model's internal weights — like retraining a chef's skills. Prompt-space optimization (GCR) rewrites the instructions given to the model — like giving that same chef a better recipe.

These are orthogonal, meaning they work on different dimensions and can be combined. A key finding: prompt refinement alone sometimes matches or beats RL, suggesting that current generators are more capable than their default prompts reveal.

Business implication: Companies can improve image generation quality at inference time without expensive retraining — just by adding a critique-and-refine step to their existing pipeline.

Parameter Space (Training Time)

Multi-dimensional scores provide semantically decomposed reward signals for reinforcement learning. Each quality dimension (text faithfulness, image faithfulness, visual quality, text rendering) provides targeted gradient information, enabling fine-grained optimization rather than chasing a single opaque scalar. This dense feedback helps the generator understand what to improve and why.

Prompt Space (Test Time)

Natural-language rationales identify concrete deficiencies in generated images — for example, “the instruction says no umbrellas but the image contains one.” These critiques are translated into targeted prompt revisions in a Generate-Critique-Refine (GCR) loop. This purely test-time intervention requires no parameter updates and can be applied to any frozen generator, trading compute for quality.

Takeaway: Trained Reward Models vs. Generic VLM Judges

Why not just use a capable generic VLM (like Qwen3-VL-32B) as a judge? Beyond the practical advantage of a smaller 8B model, there’s a fundamental reason: structured training on preference data teaches calibrated judgment norms that generic VLMs lack. Generic models can articulate critiques but fail to calibrate severity — they often over-penalize minor issues while under-weighting critical failures. PARROT’s preference-anchored training addresses this by grounding judgments in actual human preferences.

Calibrated judgment norms means the model has learned how much different types of issues matter, not just that they matter. A generic VLM might rate a missing shadow as severely as a missing object, but a preference-trained model learns that object fidelity matters far more than subtle lighting issues.

Test-Time Scaling

Generate-Critique-Refine: Test-Time Scaling

The Generate-Critique-Refine (GCR) loop provides test-time compute scaling for image generation. After initial generation, RationalRewards critiques the output, identifying specific deficiencies with detailed reasoning. These critiques are translated into prompt revisions, and the generator re-generates. This purely test-time intervention requires no parameter updates and can be applied to any frozen generator, demonstrating that current generators harbor latent capabilities that suboptimal prompts fail to elicit.

Figure 7: Generate-Critique-Refine loop example with a rain scene prompt — **Figure 7:** A concrete GCR example. The prompt requests a romantic rain scene with “no umbrellas.” The initial generation violates this constraint. RationalRewards critiques the violation with multi-dimensional reasoning, generating a refined prompt that explicitly reinforces the constraint for re-generation.

Takeaway: Why Reasoning Rewards Enable Test-Time Scaling

Generators often possess latent capacity for high-quality outputs that is under-elicited by suboptimal prompts. RationalRewards unlocks this capacity without weight modification through actionable critiques. Scalar rewards cannot identify what went wrong, only that something did. Structured reasoning pinpoints the specific deficiency and prescribes concrete prompt corrections — enabling effective test-time compute scaling.

Test-time compute scaling is a concept borrowed from language models (like chain-of-thought reasoning in LLMs). The idea: spend more compute at inference time to get better results, rather than relying solely on a larger/better-trained model. RationalRewards brings this concept to visual generation for the first time.

Results

Experiments & Results

RationalRewards is evaluated across both image editing and text-to-image generation tasks. Training data includes 30K query-preference pairs from EditReward (image editing) and 50K pairs from ImageRewardDB (text-to-image). The Teacher is Qwen3-VL-32B-Instruct and the Student backbone is Qwen3-VL-8B-Instruct.

Benchmark guide: ImgEdit-Bench measures image editing quality (how well edits follow instructions). GEdit-Bench-EN evaluates general editing across multiple dimensions. UniGenBench++ tests text-to-image generation across categories like logic, relations, and attributes. Higher is better for all three.

4.43

ImgEdit-Bench

Image Editing (Qwen-Image +RL+PT)

8.33

GEdit-Bench-EN

General Editing (Qwen-Image +RL+PT)

82.60

UniGenBench++

Text-to-Image (Qwen-Image [RL])

Accuracy in Preference Modeling

The 8B-parameter RationalRewards surpasses all open-source scalar reward models by a substantial margin across all three benchmarks (MMRB2, EditReward-Bench, GenAI-Bench), without requiring complex loss designs to handle label noise. It is competitive with Gemini-2.5-Pro, a much larger proprietary model. Ablation shows that PARROT’s variational framework outperforms direct SFT distillation from the same 32B teacher, confirming that the structured training pipeline — not just model capacity — drives the improvement.

Table 1: Comparison of reward models as evaluators across multiple benchmarks — **Table 1:** Reward model comparison. RationalRewards (8B) achieves the highest scores across MMRB2, EditReward-Bench, and GenAI-Bench among open-source models.

Dual-Space Optimization Results

RL with RationalRewards yields consistent improvements across both image editing and text-to-image generation. A striking finding: inference-time prompt tuning frequently matches or exceeds computationally expensive RL. On ImgEdit-Bench, prompt tuning boosts the RL-tuned Flux model from 3.84 to 4.01. This supports the hypothesis that prompt-space optimization offers an orthogonal and complementary axis to parameter-space training.

Table 2: Text-to-image RL ablation on UniGenBench++ — **Table 2:** Ablation of RationalRewards for text-to-image RL on UniGenBench++, comparing scalar reward (MultiReward) and generic reasoning reward (Qwen3-VL-32B).

Table 3: Dual-space optimizer ablation on image editing tasks — **Table 3:** Ablation of RationalRewards as a dual-space optimizer on editing tasks. Prompt tuning (PT) and RL results for both Flux-Kontext and Qwen-Image generators.

A Surprising Finding: Prompt Tuning Rivals RL

One of the paper's most striking results is that test-time prompt tuning frequently matches or exceeds expensive RL training. This is counterintuitive — you'd expect changing model weights (RL) to be more powerful than just rewriting prompts.

The explanation: RL with LoRA has limited update capacity and may not cover the full evaluation distribution. Prompt tuning performs per-instance optimization without risking catastrophic forgetting. Moreover, it suggests that current generators already have the capability to produce great outputs — they just need better instructions.

Practical takeaway: Before investing in expensive RL retraining, try improving your prompts with structured critique feedback. You might get similar results at a fraction of the cost.

Qualitative Results

Figure 6: Qualitative comparison of image editing and generation improvements — **Figure 6:** Qualitative improvements from RL and prompt tuning across diverse editing and generation tasks. Each row shows the source image, base output, and outputs after RL and prompt tuning stages.

Evaluation

Pointwise Scoring in Action

Figure 5: Example pointwise scores across multiple generators on an editing task — **Figure 5:** RationalRewards evaluates each generated image across multiple quality dimensions (Text Faithfulness, Image Faithfulness, Physical/Visual Quality, Text Rendering) with numerical scores. This multi-dimensional scoring enables targeted feedback for both RL training and prompt refinement.

Applications

Four Applications of RationalRewards

🔍

Data Filtering

Automated quality control for data curation using multi-dimensional scores and explainable rationales. Low-quality training examples are identified and filtered with transparency.

🎯

RL Reward Signal

Dense, semantically decomposed reward signals drive fine-grained RL optimization. Each quality dimension provides targeted gradient information instead of a single opaque scalar.

✏️

Prompt Rewriting

The GCR loop trades test-time compute for generation quality. Applicable to any frozen generator without parameter updates or risk of catastrophic forgetting.

👁

Critique Visualization

Integration with Grounding DINO+SAM localizes identified defects in generated images, providing visual evidence that grounds each critique in specific image regions.

Context

Related Work

Reward Models for Visual Generation

The standard paradigm relies on scalar reward models trained on large-scale human preference datasets. Models such as ImageReward, VideoReward, PickScore, and UnifiedReward output a single scalar score trained with binary or ranking losses. While effective for basic preference prediction, these models discard the reasoning behind their judgments and are vulnerable to reward hacking. Recent works like VLMRM, RM-RL, and Video-SALMONN have begun incorporating reasoning capabilities, but none provide the variational framework and dual-space optimization that RationalRewards introduces.

Training and Test-Time Scaling in Visual Generation

Recent efforts like FlowGRPO, DanceGRPO, Blip30-Next, and DiffusionNFT have successfully integrated RL into visual generation training. On the test-time side, compute scaling has been extensively explored in language models but remains nascent for visual generation. RationalRewards bridges this gap by demonstrating that structured reasoning enables effective test-time optimization for image generators, complementing parameter-space training with a prompt-space approach.

Keywords

Reward Models Visual Generation Reinforcement Learning Reasoning PARROT Test-Time Scaling Prompt Tuning Multi-Dimensional Evaluation Image Editing Text-to-Image

References (23 key citations)

Bai, S. et al. Qwen2.5-VL Technical Report. arXiv:2502.13923, 2025.
Chen, J. et al. Blip30-Next: Next Frontier of Native Image Generation. arXiv:2510.15857, 2025.
Chen, X. et al. RM-RL: Reward Modeling as Reasoning. arXiv:2505.02387, 2026.
Comanici, G. et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning. arXiv:2503.21218, 2025.
Deng, C. et al. Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683, 2025.
Esser, P. et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML, 2024.
Google DeepMind. Imagen 3. Technical Report, 2025.
Guo, D. et al. DeepSeek-R1: Incentivizing Reasoning Capability. arXiv:2501.12948, 2025.
Hu, Y. et al. Multimodal Reward Bench 2. arXiv, 2025.
Jiang, Z. et al. GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation. CVPR, 2024.
Kirstain, Y. et al. Pick-a-Pic: An Open Dataset of User Preferences for T2I Generation. NeurIPS, 2023.
Liu, Z. et al. FlowGRPO: Guided Reward Policy Optimization on Flow Matching Models. arXiv, 2025a.
Liu, Z. et al. VideoReward: A Reward Model for Video Generation. arXiv, 2025b.
Mahan, F. et al. Generative Verifiers for Self-Improving Rewards. arXiv, 2024.
OpenAI. GPT-Image-1. Technical Report, 2025.
Snell, C. et al. Scaling LLM Test-Time Compute Optimally. arXiv, 2024.
Wang, H. et al. Reward Models for Visual Generation (various). arXiv, 2025d.
Wang, H. et al. Prompt Enhancement (various). arXiv, 2025g.
Wu, X. et al. EditReward: Preference-Based Reward for Image Editing. arXiv, 2025e.
Xu, J. et al. ImageReward: Learning and Evaluating Human Preferences for T2I. NeurIPS, 2023.
Xue, Y. et al. DanceGRPO: Group-Relative Policy Optimization for Dance Generation. arXiv, 2025.
Zelikman, E. et al. STaR: Self-Taught Reasoner. NeurIPS, 2022.
Zheng, K. et al. DiffusionNFT: Aligning Diffusion with Human Feedback via RL. arXiv, 2025.