In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.
Traditional image generators used convolutional networks for the denoising backbone. A Diffusion Transformer replaces that with a Transformer architecture — the same kind of attention-based network used in LLMs. The image is split into patches (like tokens in text), and the Transformer processes all patches simultaneously. FLUX and Stable Diffusion 3 are prominent examples. Because attention sees every patch at once, DiTs handle global composition better but come with higher compute cost.
Stable Flow identified “vital layers” within the transformer whose exclusion produces significant shifts in model outputs. Building on this, the authors systematically analyzed each DiT block’s contribution using the Qwen3 model and 64 diverse text prompts with the FLUX model.
For each DiT block l ∈ L, they bypassed its residual output (setting γ=0) and measured the effect on Image Reward score. Surprisingly, removing certain blocks can occasionally enhance the quality of generated images rather than degrade it.
In a Transformer, each block adds its output to its own input (a "residual connection"). Setting γ=0 means the block contributes nothing — its output is zeroed out. The surprising finding is that some blocks are actually hurting the image quality, because they were never optimally trained for the specific generation task. This is a diagnostic result that motivates Calibri: the default weights are suboptimal, and a simple scalar per block can fix it post-hoc without retraining.
In a second experiment, they scaled each block’s output by a scalar s ∈ {0, 0.25, 0.5, 0.75, 1.25, 1.5}. The result: for each DiT block, there exists an optimal scaling factor that improves performance over its original configuration.
The standard DiT architecture is sub-optimally weighted, and its performance can be significantly improved through a simple post-hoc calibration of its blocks.
Calibri defines calibration parameters c = ω ∪ {si} where ω denotes output-level weights and si denotes internal-layer parameters. Three granularities are introduced:
Uniformly scales Attention + MLP outputs within the same block using a single shared scalar s. Coarsest granularity. 57 parameters for FLUX, converges in 200 iterations / 32 GPU-hours.
Scales individual layers within each block using distinct coefficients. More flexibility. 76 parameters for FLUX. Best balance of performance and training speed — most consistent improvements across all reward functions.
Gate-wise calibration for MM-DiT models with separate scaling for visual and text gates. 114 parameters for FLUX. Highest HPSv3 target metric but slower convergence (960 iters / 150 GPU-hours).
To find optimal calibration coefficients, Calibri uses the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) — a gradient-free optimization algorithm. At each iteration, candidate solutions are sampled from a multivariate Gaussian N(μ, σ2C), evaluated via a reward model, and used to update μ, σ, and C.
Standard neural network training uses backpropagation — computing gradients of a loss function to update weights. But here the objective is an image reward model (HPSv3, Image Reward) that scores thousands of generated images, and there is no clean differentiable path from the scaling scalars to that score. CMA-ES treats the whole problem as a black box: it proposes many candidate scalar sets, generates images with each, measures the reward, then uses the top performers to update a Gaussian distribution from which the next generation of candidates is drawn. No gradients required — only image evaluations.
Calibri Ensemble aggregates N differently calibrated models into a single sampler:
F{ci}(x, t, p) = ∑i=1N ωi fθsi(x, t, p|∅)
For N=2 with block scaling, Calibri Ensemble generalizes Skip Layer Guidance (Spatiotemporal Guidance), making it a training-free case of Auto-guidance. Ensemble calibration consistently increases HPSv3 reward across all inference steps and shifts the optimal number of sampling steps from 30–50 to only 10–15.
Diffusion models generate images by gradually denoising random noise across many steps. Each step runs the full neural network (an NFE — Number of Function Evaluations). More steps generally produce higher quality, but also take proportionally more time. FLUX's default is 30 steps; SD-3.5M uses 80. Calibri Ensemble shifts the quality-vs-steps curve so that only 10–15 steps are needed to match what 30–50 steps previously produced, effectively delivering a 2–3× speedup for free.
All experiments use the Flux model, with optimization guided by HPSv3 reward. Train and test prompts come from T2I-Compbench++. Bucket size: 16, image resolution: 512, inference steps: 15 for training.
HPSv3 is a learned reward model trained on large-scale human pairwise preference data. Given two images for the same prompt, human annotators vote which they prefer. A neural network learns to predict those votes, producing a scalar score. HPSv3 is the third version of this metric, refined to better align with human aesthetics. Image Reward (IR) and Q-Align are alternative reward models: IR is also preference-trained, while Q-Align is a quality assessment model that outputs scores on a 1–5 scale. Using multiple reward metrics reduces the risk of over-optimizing for one metric's idiosyncrasies.
| Scaling | N params | Iters | HPSv3 | IR | Q-Align |
|---|---|---|---|---|---|
| — | — | — | 11.41 | 1.15 | 4.85 |
| Block | 57 | 200 | 13.29 | 1.17 | 4.91 |
| Layer | 76 | 410 | 13.41 | 1.24 | 4.90 |
| Gate | 114 | 960 | 13.48 | 1.18 | 4.88 |
Gate scaling achieves the highest HPSv3 but underperforms on alternative rewards. Layer scaling yields the most consistent improvements and fastest training speed.
| Model | Calibri | HPSv3 | IR | Q-Align | NFE |
|---|---|---|---|---|---|
| FLUX | × | 11.41 | 1.15 | 4.85 | 30 |
| ✓ | 13.48 | 1.18 | 4.88 | 15 | |
| SD-3.5M | × | 11.15 | 1.10 | 4.74 | 80 |
| ✓ | 14.10 | 1.17 | 4.91 | 30 | |
| Qwen Image | × | 11.26 | 1.16 | 4.55 | 100 |
| ✓ | 12.95 | 1.18 | 4.73 | 30 |
Reward models like HPSv3 can be "gamed" — a model could learn patterns that score highly on the metric without genuinely looking better to people. Human evaluation with real users (here 200 evaluators, 5,600 pairwise comparisons) provides ground truth on whether the improvement is perceptually real. A win rate above 50% means evaluators preferred Calibri's output more than the baseline when shown both side-by-side. The fact that Calibri achieves 51.9% (FLUX) and 54.6% (Qwen-Image) win rates confirms the metric gains are not artifacts.
| Methods | Overall Preference | Text Alignment | ||||
|---|---|---|---|---|---|---|
| Calibri | Equal | Original | Calibri | Equal | Original | |
| Flux | 51.87 | 7.33 | 40.80 | 38.71 | 37.68 | 23.61 |
| Qwen-Image | 54.62 | 7.91 | 37.47 | 40.29 | 37.65 | 22.06 |
Evaluators decisively prefer Calibri in both Overall Preference and Text Alignment, confirming genuine perceptual gains (not reward artifacts). Calibrated models are also 2–3× faster than baselines.
Qualitative comparisons across diverse prompts and model architectures confirm the consistent visual improvements reported in the quantitative evaluation. All models use the same NFE as in Table 2.
Flow-GRPO is a reinforcement learning-based alignment method for flow matching diffusion models. It uses Group Relative Policy Optimization (GRPO) — a technique from LLM RLHF — to fine-tune the model's full weights toward a reward signal. Unlike Calibri, Flow-GRPO modifies millions to billions of parameters through gradient-based optimization and requires substantial GPU compute. The comparison here shows that Calibri (~100 scalars, gradient-free) can match Flow-GRPO's perceptual quality gains, and the two methods are complementary and composable.
In this work, we introduced Calibri, a novel and parameter-efficient approach to enhance the generative capabilities of Diffusion Transformers (DiTs). By uncovering the potential of a single learned scaling parameter to optimize the contributions of DiT components, we demonstrated that significant performance improvements can be achieved with minimal parameter modifications.
Framing the calibration process as a black-box optimization problem solved via the CMA-ES evolutionary strategy, Calibri adjusts only ~102 parameters while delivering consistently improved generation quality. Additionally, the proposed inference-time scaling technique, Calibri Ensemble, effectively combines calibrated models to further enhance results.
Our extensive empirical evaluation across a range of text-to-image diffusion models confirmed the effectiveness and efficiency of Calibri, highlighting its ability to achieve superior generative quality with reduced computational costs. Notably, Calibri successfully reduces the number of inference steps required for image generation while retaining high-quality outputs, making it a practical solution for real-world applications where computational efficiency is critical.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.