RealRestorer Model
An open-source real-world image restoration model that sets a new state of the art, achieving performance highly comparable to closed-source systems. Fine-tuned from Step1X-Edit across nine degradation tasks.
Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.
Most restoration models are trained on synthetic degradations — e.g., applying Gaussian blur to a clean photo — then evaluated on the same synthetic test set. Real-world degradations (camera shake, compression from social media, outdoor haze) follow distributions that synthetic pipelines can't fully replicate. The synthetic-to-real domain gap means a model that achieves excellent PSNR on benchmarks can still fail visibly on authentic photographs.
An open-source real-world image restoration model that sets a new state of the art, achieving performance highly comparable to closed-source systems. Fine-tuned from Step1X-Edit across nine degradation tasks.
A high-quality degradation synthesis pipeline covering 9 degradation types with 1.65M+ paired training samples, combining synthetic and real-world degradation data with granular noise modeling and segment-aware perturbations.
A new benchmark with 464 real-world degraded images spanning 9 degradation categories, with tailored non-reference evaluation metrics that measure both degradation removal capability and content consistency preservation.
RealRestorer fine-tunes Step1X-Edit, a practical general image editing framework built on a Diffusion in Transformer (DiT) backbone. The model uses a QwenVL text encoder to inject high-level semantic information into the denoising pathway, with a dual-stream design to process semantic information alongside noise and the conditional input image. Reference and output images are both encoded via Flux-VAE.
Traditional diffusion models use a U-Net for the denoising network. DiT replaces it with a Vision Transformer — the image is split into patches, positional embeddings are added, and a stack of transformer self-attention blocks refines the noisy patches. Transformers scale much better with model size and allow better integration of multimodal conditioning (text, reference image), which is why recent frontier image models (FLUX, SD3, Step1X) all use DiT backbones.
Training proceeds in two stages:
All experiments are conducted on 8 NVIDIA H800 GPUs. The entire training process takes approximately one day.
The chart shows Final Score (FS) performance on RealIR-Bench across training steps for both stages. In the Transfer Training Stage (blue), the model rapidly acquires basic restoration capability, peaking at FS ≈ 0.122 around 2,000 steps, then declining due to limited synthetic data diversity.
The Supervised Fine-tuning Stage with real-world data (purple) quickly surpasses the transfer training peak and continues improving, reaching FS ≈ 0.145 at ~2,500 steps. Beyond this point, overfitting on the real-world data motivates early stopping.
The Progressively-Mixed training strategy (combining synthetic and real data at 2:8 ratio) prevents overfitting while preserving cross-task robustness. Ablation confirms removing this strategy reduces FS by 0.004 points.
Rather than training on only real data in Stage 2, the model sees a 2:8 mix of real:synthetic per batch. This prevents forgetting the general-purpose restoration capability learned in Stage 1 (a phenomenon called catastrophic forgetting). The ratio 2:8 was chosen via ablation — more synthetic keeps generalization but slows real-world adaptation; less synthetic risks overfitting.
Traditional image restoration benchmarks primarily focus on single-degradation tasks with synthetic corruptions, which are insufficient for evaluating model performance in real-world applications. To properly evaluate restoration under real-world degradations, we construct RealIR-Bench:
Two complementary metrics characterize both restoration effectiveness and content fidelity:
FS = 0.2 × (1 − LPS) × RS
FS jointly reflects restoration improvement and content preservation. Poor performance in either aspect leads to a lower overall score.
Unlike PSNR/SSIM, RealIR-Bench evaluates on authentic real-world images without clean reference pairs, enabling a more practical and comprehensive assessment of restoration models.
PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) both require a clean reference image to compare against. For real-world degraded photos, no such reference exists. Additionally, high PSNR can be achieved by simply returning the input unchanged — a model that does nothing scores well on PSNR if the degradation is subtle. RealIR-Bench's VLM-based Restoration Score directly judges whether degradation was actually removed, and LPIPS checks that content wasn't hallucinated or altered.
LPS is LPIPS (0 = identical to input, 1 = completely different). (1 − LPS) rewards content preservation. RS is the VLM-rated degradation removal score on 0–5. The product means both must be good: a model that removes all degradation but warps the image gets penalized by low (1-LPS); a model that preserves content perfectly but does nothing gets penalized by low RS. The 0.2 scalar normalizes the score into a convenient range around 0.1–0.15 for typical models.
RealRestorer ranks #1 among all open-source models on RealIR-Bench (9-task average FS = 0.146), narrowing the gap with Nano Banana Pro (FS = 0.153) to just 0.007 points — achieving performance comparable to leading closed-source commercial systems.
To examine the contribution of the proposed two-stage training strategy, the authors train models using only synthetic degradation data, only real-world degradation data, and the full proposed strategy.
The two-stage Progressively-Mixed strategy is the key to balancing restoration capability and content consistency, leading to more visually stable and coherent restoration results.
We introduce RealRestorer, a robust open-source image editing model for complex real-world image restoration. To reduce the synthetic-to-real domain gap, we propose a comprehensive data generation pipeline and a two-stage progressively mixed training strategy that combines synthetic and real-to-clean pairs.
We further present RealIR-Bench, a non-reference benchmark with authentic degraded images and a VLM-based evaluation framework for real-world restoration. Extensive experiments demonstrate that RealRestorer achieves open-source state-of-the-art performance across nine restoration tasks, with results highly comparable to leading closed-source commercial systems, and exhibits strong zero-shot generalization to unseen degradations.
We will release our model, data synthesis pipeline, and benchmark to support future research in real-world image restoration.
The base model relies on a 28-step denoising process, making it computationally more expensive than smaller specialized networks. In cases with strong semantic ambiguity (e.g., mirror selfies), the model may fail to distinguish true scene content from undesired reflections. The model also struggles with extremely severe degradations where reliable pixel evidence is largely missing.
Diffusion models generate images by iteratively denoising from random noise. Each step requires a full forward pass through the DiT backbone (~4B parameters for FLUX-scale models). 28 steps × one full forward pass = ~28× the cost of a single-step model. Smaller specialized restorers (like NAFNet or Restormer) solve a direct regression in one pass. The trade-off: diffusion models produce more realistic textures and generalize better, but run 10–50× slower than single-step networks.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.