RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

Abstract

Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

Why is real-world image restoration hard?

Most restoration models are trained on synthetic degradations — e.g., applying Gaussian blur to a clean photo — then evaluated on the same synthetic test set. Real-world degradations (camera shake, compression from social media, outdoor haze) follow distributions that synthetic pipelines can't fully replicate. The synthetic-to-real domain gap means a model that achieves excellent PSNR on benchmarks can still fail visibly on authentic photographs.

Key Contributions

RealRestorer Model

An open-source real-world image restoration model that sets a new state of the art, achieving performance highly comparable to closed-source systems. Fine-tuned from Step1X-Edit across nine degradation tasks.

Large-Scale Data Pipeline

A high-quality degradation synthesis pipeline covering 9 degradation types with 1.65M+ paired training samples, combining synthetic and real-world degradation data with granular noise modeling and segment-aware perturbations.

RealIR-Bench

A new benchmark with 464 real-world degraded images spanning 9 degradation categories, with tailored non-reference evaluation metrics that measure both degradation removal capability and content consistency preservation.

Method: RealRestorer

Architecture & Training Strategy

RealRestorer fine-tunes Step1X-Edit, a practical general image editing framework built on a Diffusion in Transformer (DiT) backbone. The model uses a QwenVL text encoder to inject high-level semantic information into the denoising pathway, with a dual-stream design to process semantic information alongside noise and the conditional input image. Reference and output images are both encoded via Flux-VAE.

What is a DiT (Diffusion in Transformer) backbone?

Traditional diffusion models use a U-Net for the denoising network. DiT replaces it with a Vision Transformer — the image is split into patches, positional embeddings are added, and a stack of transformer self-attention blocks refines the noisy patches. Transformers scale much better with model size and allow better integration of multimodal conditioning (text, reference image), which is why recent frontier image models (FLUX, SD3, Step1X) all use DiT backbones.

Training proceeds in two stages:

Transfer Training Stage — Uses 1.5M synthetic paired samples to transfer high-level knowledge and priors from image editing to image restoration. Learning rate is kept constant at 1×10⁻⁵ with a global batch size of 16. Resolution fixed at 1024×1024.
Supervised Fine-tuning Stage — Incorporates 80K real-world degradation data pairs to further enhance restoration fidelity. Uses a cosine annealing learning rate schedule, Progressively-Mixed training strategy (2:8 real/synthetic ratio), and freezes the first ¼ of SingleStreamBlocks for stability.

Why two stages? Transfer learning for restoration

Stage 1 (1.5M synthetic pairs): Teaches the model what "degraded → clean" looks like across all 9 degradation types. The editing model's generative priors are repurposed: instead of "change the style of this image," the model learns "remove this specific degradation."
Stage 2 (80K real pairs): Fine-tunes on authentic real-world degraded photos. Real degradations are messier and often multi-modal (e.g., dark + blurry), so this stage teaches robustness that synthetic data alone cannot provide.
Why not just Stage 2? 80K samples would cause severe overfitting without the broad prior from Stage 1. The ablation confirms this: real-only training causes object deformation and unrealistic enhancements.

All experiments are conducted on 8 NVIDIA H800 GPUs. The entire training process takes approximately one day.

Overview of the large-scale Synthetic Degradation Data pipeline showing 9 degradation types (Blur, Compression, Moiré, Low-light, Noise, Flare, Reflection, Haze, Rain) with their synthesis flow from clean images to degraded images using various tools including VLMs Filter, UniDemoire, Retinexformer, Real-ESRGAN, SAM3, SynNet, and Intel Labs depth estimation. — Figure 2. Overview of the large-scale Synthetic Degradation Data pipeline. Nine representative degradation types are covered. Compared with previous synthetic-only pipelines, the framework incorporates granular noise modeling, segment-aware perturbations, and web-style degradation processes.

Two-Stage Training Analysis

The chart shows Final Score (FS) performance on RealIR-Bench across training steps for both stages. In the Transfer Training Stage (blue), the model rapidly acquires basic restoration capability, peaking at FS ≈ 0.122 around 2,000 steps, then declining due to limited synthetic data diversity.

The Supervised Fine-tuning Stage with real-world data (purple) quickly surpasses the transfer training peak and continues improving, reaching FS ≈ 0.145 at ~2,500 steps. Beyond this point, overfitting on the real-world data motivates early stopping.

The Progressively-Mixed training strategy (combining synthetic and real data at 2:8 ratio) prevents overfitting while preserving cross-task robustness. Ablation confirms removing this strategy reduces FS by 0.004 points.

What is "Progressively-Mixed" training?

Rather than training on only real data in Stage 2, the model sees a 2:8 mix of real:synthetic per batch. This prevents forgetting the general-purpose restoration capability learned in Stage 1 (a phenomenon called catastrophic forgetting). The ratio 2:8 was chosen via ablation — more synthetic keeps generalization but slows real-world adaptation; less synthetic risks overfitting.

Line chart showing Final Score (FS) performance on RealIR-Bench across training steps. Blue line: Synthetic Data Transfer Training peaks at FS 0.122 then declines. Purple line: Real-World Degradation Data SFT rises to FS 0.145. — Figure 4. Model performance (FS) with varying training steps. Blue: Transfer Training on synthetic data. Purple: Supervised Fine-tuning with real-world data. Dashed segments indicate overfitting onset.

RealIR-Bench: Evaluation Benchmark

Traditional image restoration benchmarks primarily focus on single-degradation tasks with synthetic corruptions, which are insufficient for evaluating model performance in real-world applications. To properly evaluate restoration under real-world degradations, we construct RealIR-Bench:

464 non-reference degraded images curated entirely from internet sources — not synthesized
Spans 9 degradation categories: blur, rain, noise, low-light, moiré patterns, haze, compression artifacts, reflection, and flare
Manual filtering ensures both quality control and diversity across scene content and severity levels

Evaluation Metrics

Two complementary metrics characterize both restoration effectiveness and content fidelity:

Restoration Score (RS ↑) — VLM-based (Qwen3-VL-8B-Instruct) degradation severity assessment on a 0–5 scale. Computed as the improvement in degradation level after restoration.
LPIPS (LPS ↓) — Perceptual similarity metric measuring content consistency between degraded input and restored output.

FS = 0.2 × (1 − LPS) × RS

FS jointly reflects restoration improvement and content preservation. Poor performance in either aspect leads to a lower overall score.

Unlike PSNR/SSIM, RealIR-Bench evaluates on authentic real-world images without clean reference pairs, enabling a more practical and comprehensive assessment of restoration models.

Why PSNR and SSIM fall short for real-world restoration

PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) both require a clean reference image to compare against. For real-world degraded photos, no such reference exists. Additionally, high PSNR can be achieved by simply returning the input unchanged — a model that does nothing scores well on PSNR if the degradation is subtle. RealIR-Bench's VLM-based Restoration Score directly judges whether degradation was actually removed, and LPIPS checks that content wasn't hallucinated or altered.

Examples from RealIR-Bench showing degraded images from 9 categories (Blur, Compression, Moiré, Low-light, Noise, Flare, Haze, Rain, Reflection) with their bilingual English and Chinese fixed evaluation prompts — Figure 8. Examples from RealIR-Bench. Each degradation category is evaluated using a fixed bilingual (English/Chinese) prompt. The benchmark covers diverse real-world scenes across all 9 degradation types.

Experimental Results

Understanding the Final Score formula: FS = 0.2 × (1 − LPS) × RS

LPS is LPIPS (0 = identical to input, 1 = completely different). (1 − LPS) rewards content preservation. RS is the VLM-rated degradation removal score on 0–5. The product means both must be good: a model that removes all degradation but warps the image gets penalized by low (1-LPS); a model that preserves content perfectly but does nothing gets penalized by low RS. The 0.2 scalar normalizes the score into a convenient range around 0.1–0.15 for typical models.

RealRestorer ranks #1 among all open-source models on RealIR-Bench (9-task average FS = 0.146), narrowing the gap with Nano Banana Pro (FS = 0.153) to just 0.007 points — achieving performance comparable to leading closed-source commercial systems.

Side-by-side comparison of 9 image editing models across 9 degradation types. Rows: Blur, Compression, Moiré Patterns, Low-light, Noise, Flare, Reflection, Haze, Rain. Columns: Degraded Input, RealRestorer (Ours), Seedream 4.5, Nano Banana Pro, GPT-Image-1.5, Step1X-Edit, FLUX.1-Kontext-dev, Qwen-Image-Edit-2511, LongCat-Image-Edit. — Figure 3. Comparison with state-of-the-art image editing models across nine real-world degradations. RealRestorer (Ours) produces visually cleaner and more consistent restoration results compared to other open-source methods, achieving quality competitive with leading closed-source systems. *Zoom in for details.*

Quantitative comparison table showing LPIPS, RS, and FS scores for Rain Removal, Deblurring, Low-light Enhancement, Haze Removal, and Reflection Removal tasks. Models: Nano Banana Pro, GPT-Image-1.5, Seedream 4.5, LongCat-Image-Edit, Qwen-Image-Edit-2511, FLUX.1-Kontext-dev, Step1X-Edit, RealRestorer. Best results bold, second-best underlined. Open-source best/second highlighted in yellow/blue. — Table 1. Quantitative comparison on Rain Removal, Deblurring, Low-light Enhancement, Haze Removal, and Reflection Removal (RealIR-Bench). Best result: **bold**; second-best: underlined. Open-source best/second highlighted in yellow/blue respectively.

Quantitative comparison table for Deflare, Moiré Patterns Removal, Denoise, and Compression Restoration tasks, plus 9-task average (Avg Total). RealRestorer achieves FS 0.146 average, ranking first among open-source models. — Table 2. Quantitative comparison on Deflare, Moiré Pattern Removal, Denoise, and Compression Restoration, plus the 9-task overall average. RealRestorer achieves FS = 0.146 average, ranking #1 among open-source methods.

Quantitative comparison on FoundIR dataset showing PSNR and SSIM scores for Blur, Rain, Raindrops, Noise, Low-light, Haze, and Compression tasks. RealRestorer achieves best PSNR on 5 out of 7 degradations. — Table 3. Quantitative comparison on the FoundIR dataset across various real-world degradations (PSNR ↑, SSIM ↑). RealRestorer achieves the best PSNR on 5 out of 7 degradations, demonstrating strong zero-shot generalization.

More Qualitative Results

Training dataset examples showing synthetic degradation pairs (gray labels, upper rows) and real-world degradation pairs (orange labels, lower rows) for Blur, Compression, Moiré, Low-light, Haze, Rain, Reflection, Noise, and Flare. — Figure 5. Examples from the training dataset containing both synthetic and real-world degradation pairs. Upper rows (gray labels): synthesized degradations generated by the pipeline. Lower rows (orange labels): real-world degraded images paired with clean references.

Additional qualitative restoration results of RealRestorer for Blur, Haze, Flare, Low-light, and Compression degradation types across diverse urban, nature, and portrait scenes. Each pair shows degraded input on the left and RealRestorer output on the right. — Figure 6. Additional qualitative results of RealRestorer under real-world degradations (Blur, Haze, Flare, Low-light, Compression). Please zoom in for better visualization of details.

Additional qualitative results of RealRestorer for Rain, Noise, Moiré, and Reflection degradations, plus zero-shot generalization to unseen tasks: Old Photo restoration, Snow removal, and Underwater image enhancement. — Figure 7. Additional qualitative results including zero-shot generalization to unseen degradation types (Old Photo, Snow, Underwater). RealRestorer handles these tasks despite not being explicitly trained on them. Please zoom in for better visualization of details.

Ablation Study

To examine the contribution of the proposed two-stage training strategy, the authors train models using only synthetic degradation data, only real-world degradation data, and the full proposed strategy.

Key Findings

Transfer Training only peaks at FS = 0.122 but then degrades due to the limited diversity of synthetic data distributions — highlighting that synthetic data alone is insufficient.
Real-World Fine-tuning only tends to overfit and harm structural consistency, causing object deformation, body shifting, and unrealistic enhancement effects.
Two-stage Progressively-Mixed strategy effectively balances restoration capability and content consistency. Removing this component reduces FS by 0.004 points (confirmed ablation).
User study with 32 participants rating 3,200 groups shows Nano Banana Pro at 32.02% first-ranking rate vs. RealRestorer at 21.54% — while the proposed FS metric achieves moderate statistical alignment with human judgments (p < 0.01).

The two-stage Progressively-Mixed strategy is the key to balancing restoration capability and content consistency, leading to more visually stable and coherent restoration results.

Conclusion

We introduce RealRestorer, a robust open-source image editing model for complex real-world image restoration. To reduce the synthetic-to-real domain gap, we propose a comprehensive data generation pipeline and a two-stage progressively mixed training strategy that combines synthetic and real-to-clean pairs.

We further present RealIR-Bench, a non-reference benchmark with authentic degraded images and a VLM-based evaluation framework for real-world restoration. Extensive experiments demonstrate that RealRestorer achieves open-source state-of-the-art performance across nine restoration tasks, with results highly comparable to leading closed-source commercial systems, and exhibits strong zero-shot generalization to unseen degradations.

We will release our model, data synthesis pipeline, and benchmark to support future research in real-world image restoration.

Limitations

The base model relies on a 28-step denoising process, making it computationally more expensive than smaller specialized networks. In cases with strong semantic ambiguity (e.g., mirror selfies), the model may fail to distinguish true scene content from undesired reflections. The model also struggles with extremely severe degradations where reliable pixel evidence is largely missing.

Why 28 denoising steps and why is that expensive?

Diffusion models generate images by iteratively denoising from random noise. Each step requires a full forward pass through the DiT backbone (~4B parameters for FLUX-scale models). 28 steps × one full forward pass = ~28× the cost of a single-step model. Smaller specialized restorers (like NAFNet or Restormer) solve a direct regression in one pass. The trade-off: diffusion models produce more realistic textures and generalize better, but run 10–50× slower than single-step networks.

RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

Abstract

Why is real-world image restoration hard?

Key Contributions

RealRestorer Model

Large-Scale Data Pipeline

RealIR-Bench

Method: RealRestorer

Architecture & Training Strategy

What is a DiT (Diffusion in Transformer) backbone?

Why two stages? Transfer learning for restoration

Two-Stage Training Analysis

What is "Progressively-Mixed" training?

RealIR-Bench: Evaluation Benchmark

Evaluation Metrics

Why PSNR and SSIM fall short for real-world restoration

Experimental Results

Understanding the Final Score formula: FS = 0.2 × (1 − LPS) × RS

Qualitative Comparison on RealIR-Bench

Table 1: Quantitative Results — Rain, Deblurring, Low-light, Haze, Reflection

Table 2: Quantitative Results — Deflare, Moiré, Denoise, Compression + 9-Task Average

Table 3: Zero-Shot Generalization on FoundIR Dataset

More Qualitative Results

Training Data: Synthetic and Real-World Pairs

Restoration Results: Blur · Haze · Flare · Low-light · Compression

Restoration Results: Rain · Noise · Moiré · Reflection + Zero-Shot Tasks

Ablation Study

Key Findings

Conclusion

Limitations

Why 28 denoising steps and why is that expensive?