---
arxiv_id: 2603.25502
title: "RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models"
authors:
  - Yufeng Yang
  - Xianfang Zeng
  - Zhangqi Jiang
  - Fukun Yin
  - Jianzhuang Liu
  - Wei Cheng
  - jinghong lan
  - Shiyu Liu
  - Yuqi Peng
  - Gang YU
  - Shifeng Chen
difficulty: Intermediate
tags:
  - Vision
  - Diffusion
published_at: 2026-03-26
flecto_url: https://flecto.zer0ai.dev/papers/2603.25502/
lang: en
---

## Abstract

> Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving
          and object detection. However, existing restoration models are often limited by the scale and distribution
          of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image
          editing models have shown strong generalization ability in restoration tasks, especially for closed-source
          models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving
          such performance with those large universal models requires substantial data and computational costs.
          To address this issue, we construct a large-scale dataset covering nine common real-world degradation types
          and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives.
          Furthermore, we introduce RealIR-Bench , which contains 464 real-world degraded images and
          tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments
          demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

#### Why is real-world image restoration hard?

Most restoration models are trained on synthetic degradations — e.g., applying Gaussian blur to a clean photo — then evaluated on the same synthetic test set. Real-world degradations (camera shake, compression from social media, outdoor haze) follow distributions that synthetic pipelines can't fully replicate. The synthetic-to-real domain gap means a model that achieves excellent PSNR on benchmarks can still fail visibly on authentic photographs.

## Key Contributions

### RealRestorer Model

An open-source real-world image restoration model that sets a new state of the art, achieving performance
              highly comparable to closed-source systems. Fine-tuned from Step1X-Edit across nine degradation tasks.

### Large-Scale Data Pipeline

A high-quality degradation synthesis pipeline covering 9 degradation types with 1.65M+ paired training samples,
              combining synthetic and real-world degradation data with granular noise modeling and segment-aware perturbations.

### RealIR-Bench

A new benchmark with 464 real-world degraded images spanning 9 degradation categories, with tailored
              non-reference evaluation metrics that measure both degradation removal capability and content consistency
              preservation.

## Method: RealRestorer

### Architecture & Training Strategy

RealRestorer fine-tunes Step1X-Edit , a practical general image editing framework built on
            a Diffusion in Transformer (DiT) backbone. The model uses a QwenVL text encoder to inject high-level semantic
            information into the denoising pathway, with a dual-stream design to process semantic information alongside
            noise and the conditional input image. Reference and output images are both encoded via Flux-VAE.

#### What is a DiT (Diffusion in Transformer) backbone?

Traditional diffusion models use a U-Net for the denoising network. DiT replaces it with a Vision Transformer — the image is split into patches, positional embeddings are added, and a stack of transformer self-attention blocks refines the noisy patches. Transformers scale much better with model size and allow better integration of multimodal conditioning (text, reference image), which is why recent frontier image models (FLUX, SD3, Step1X) all use DiT backbones.

Training proceeds in two stages:

- Transfer Training Stage — Uses 1.5M synthetic paired samples to transfer high-level
              knowledge and priors from image editing to image restoration. Learning rate is kept constant at
              1×10 −5 with a global batch size of 16. Resolution fixed at 1024×1024.

- Supervised Fine-tuning Stage — Incorporates 80K real-world degradation data pairs to
              further enhance restoration fidelity. Uses a cosine annealing learning rate schedule, Progressively-Mixed
              training strategy (2:8 real/synthetic ratio), and freezes the first ¼ of SingleStreamBlocks for stability.

#### Why two stages? Transfer learning for restoration

- Stage 1 (1.5M synthetic pairs): Teaches the model what "degraded → clean" looks like across all 9 degradation types. The editing model's generative priors are repurposed: instead of "change the style of this image," the model learns "remove this specific degradation."

- Stage 2 (80K real pairs): Fine-tunes on authentic real-world degraded photos. Real degradations are messier and often multi-modal (e.g., dark + blurry), so this stage teaches robustness that synthetic data alone cannot provide.

- Why not just Stage 2? 80K samples would cause severe overfitting without the broad prior from Stage 1. The ablation confirms this: real-only training causes object deformation and unrealistic enhancements.

All experiments are conducted on 8 NVIDIA H800 GPUs. The entire training process takes approximately one day.

Degradation types: Rain · Blur · Low-light · Haze · Reflection · Flare · Moiré · Noise · Compression

### Two-Stage Training Analysis

The chart shows Final Score (FS) performance on RealIR-Bench across training steps for both stages.
              In the Transfer Training Stage (blue), the model rapidly acquires basic restoration capability,
              peaking at FS ≈ 0.122 around 2,000 steps, then declining due to limited synthetic data diversity.

The Supervised Fine-tuning Stage with real-world data (purple) quickly surpasses the transfer training
              peak and continues improving, reaching FS ≈ 0.145 at ~2,500 steps. Beyond this point, overfitting
              on the real-world data motivates early stopping.

The Progressively-Mixed training strategy (combining synthetic and real data at 2:8 ratio)
              prevents overfitting while preserving cross-task robustness. Ablation confirms removing this strategy
              reduces FS by 0.004 points.

#### What is "Progressively-Mixed" training?

Rather than training on only real data in Stage 2, the model sees a 2:8 mix of real:synthetic per batch. This prevents forgetting the general-purpose restoration capability learned in Stage 1 (a phenomenon called catastrophic forgetting ). The ratio 2:8 was chosen via ablation — more synthetic keeps generalization but slows real-world adaptation; less synthetic risks overfitting.

## RealIR-Bench: Evaluation Benchmark

Traditional image restoration benchmarks primarily focus on single-degradation tasks with synthetic corruptions,
            which are insufficient for evaluating model performance in real-world applications. To properly evaluate
            restoration under real-world degradations, we construct RealIR-Bench :

- 464 non-reference degraded images curated entirely from internet sources — not synthesized

- Spans 9 degradation categories : blur, rain, noise, low-light, moiré patterns, haze, compression artifacts, reflection, and flare

- Manual filtering ensures both quality control and diversity across scene content and severity levels

### Evaluation Metrics

Two complementary metrics characterize both restoration effectiveness and content fidelity:

- Restoration Score (RS ↑) — VLM-based (Qwen3-VL-8B-Instruct) degradation severity
              assessment on a 0–5 scale. Computed as the improvement in degradation level after restoration.

- LPIPS (LPS ↓) — Perceptual similarity metric measuring content consistency between
              degraded input and restored output.

FS = 0.2 × (1 − LPS) × RS

FS jointly reflects restoration improvement and content preservation. Poor performance in either aspect
              leads to a lower overall score.

Unlike PSNR/SSIM, RealIR-Bench evaluates on authentic real-world images without clean reference pairs,
              enabling a more practical and comprehensive assessment of restoration models.

#### Why PSNR and SSIM fall short for real-world restoration

PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) both require a clean reference image to compare against. For real-world degraded photos, no such reference exists. Additionally, high PSNR can be achieved by simply returning the input unchanged — a model that does nothing scores well on PSNR if the degradation is subtle. RealIR-Bench's VLM-based Restoration Score directly judges whether degradation was actually removed, and LPIPS checks that content wasn't hallucinated or altered.

## Experimental Results

#### Understanding the Final Score formula: FS = 0.2 × (1 − LPS) × RS

LPS is LPIPS (0 = identical to input, 1 = completely different). (1 − LPS) rewards content preservation. RS is the VLM-rated degradation removal score on 0–5. The product means both must be good : a model that removes all degradation but warps the image gets penalized by low (1-LPS); a model that preserves content perfectly but does nothing gets penalized by low RS. The 0.2 scalar normalizes the score into a convenient range around 0.1–0.15 for typical models.

RealRestorer ranks #1 among all open-source models on RealIR-Bench
            (9-task average FS = 0.146 ),
            narrowing the gap with Nano Banana Pro (FS = 0.153) to just 0.007 points — achieving performance comparable to leading closed-source commercial systems.

### Qualitative Comparison on RealIR-Bench

### Table 1: Quantitative Results — Rain, Deblurring, Low-light, Haze, Reflection

### Table 2: Quantitative Results — Deflare, Moiré, Denoise, Compression + 9-Task Average

### Table 3: Zero-Shot Generalization on FoundIR Dataset

## More Qualitative Results

### Training Data: Synthetic and Real-World Pairs

### Restoration Results: Blur · Haze · Flare · Low-light · Compression

### Restoration Results: Rain · Noise · Moiré · Reflection + Zero-Shot Tasks

## Ablation Study

To examine the contribution of the proposed two-stage training strategy, the authors train models using only
            synthetic degradation data, only real-world degradation data, and the full proposed strategy.

### Key Findings

- Transfer Training only peaks at FS = 0.122 but then degrades due to the limited diversity
              of synthetic data distributions — highlighting that synthetic data alone is insufficient.

- Real-World Fine-tuning only tends to overfit and harm structural consistency,
              causing object deformation, body shifting, and unrealistic enhancement effects.

- Two-stage Progressively-Mixed strategy effectively balances restoration capability and
              content consistency. Removing this component reduces FS by 0.004 points (confirmed ablation).

- User study with 32 participants rating 3,200 groups shows Nano Banana Pro at 32.02%
              first-ranking rate vs. RealRestorer at 21.54% — while the proposed FS metric achieves moderate
              statistical alignment with human judgments (p < 0.01).

The two-stage Progressively-Mixed strategy is the key to balancing restoration capability and content
              consistency, leading to more visually stable and coherent restoration results.

## Conclusion

We introduce RealRestorer , a robust open-source image editing model for complex real-world
            image restoration. To reduce the synthetic-to-real domain gap, we propose a comprehensive data generation
            pipeline and a two-stage progressively mixed training strategy that combines synthetic and real-to-clean pairs.

We further present RealIR-Bench , a non-reference benchmark with authentic degraded images
            and a VLM-based evaluation framework for real-world restoration. Extensive experiments demonstrate that
            RealRestorer achieves open-source state-of-the-art performance across nine restoration tasks, with results
            highly comparable to leading closed-source commercial systems, and exhibits strong zero-shot generalization
            to unseen degradations.

We will release our model, data synthesis pipeline, and benchmark to support future research in
            real-world image restoration.

### Limitations

The base model relies on a 28-step denoising process, making it computationally more expensive than smaller
            specialized networks. In cases with strong semantic ambiguity (e.g., mirror selfies), the model may fail to
            distinguish true scene content from undesired reflections. The model also struggles with extremely severe
            degradations where reliable pixel evidence is largely missing.

#### Why 28 denoising steps and why is that expensive?

Diffusion models generate images by iteratively denoising from random noise. Each step requires a full forward pass through the DiT backbone (~4B parameters for FLUX-scale models). 28 steps × one full forward pass = ~28× the cost of a single-step model. Smaller specialized restorers (like NAFNet or Restormer) solve a direct regression in one pass. The trade-off: diffusion models produce more realistic textures and generalize better, but run 10–50× slower than single-step networks.
