arXiv 2603.24800 · cs.CV · Mar 2026

Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

Danil Tokhchukov · Aysel Mirzoeva · Andrey Kuznetsov · Konstantin Sobolev
MSU / FusionBrain Lab, AXXX

Only ~100 parameters. Consistent quality gains across FLUX, SD-3.5M, and Qwen-Image. Reduces inference steps by 2–3×.

Project Page arXiv

Figure 1. Introducing Calibri — a parameter-efficient method for diffusion transformer alignment. By optimizing only ~10² parameters, Calibri significantly enhances the model’s generation quality.

Abstract

In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

What is a Diffusion Transformer (DiT)?

Traditional image generators used convolutional networks for the denoising backbone. A Diffusion Transformer replaces that with a Transformer architecture — the same kind of attention-based network used in LLMs. The image is split into patches (like tokens in text), and the Transformer processes all patches simultaneously. FLUX and Stable Diffusion 3 are prominent examples. Because attention sees every patch at once, DiTs handle global composition better but come with higher compute cost.

~100 Parameters

Only 57–114 scalars to optimize

Consistent Quality Gains

HPSv3 +2–3pt across FLUX, SD-3.5M, Qwen-Image

2–3× Faster Inference

Reduce NFE from 30–100 to 15–30 steps

Motivation: DiT Blocks Are Sub-Optimally Weighted

Stable Flow identified “vital layers” within the transformer whose exclusion produces significant shifts in model outputs. Building on this, the authors systematically analyzed each DiT block’s contribution using the Qwen3 model and 64 diverse text prompts with the FLUX model.

For each DiT block l ∈ L, they bypassed its residual output (setting γ=0) and measured the effect on Image Reward score. Surprisingly, removing certain blocks can occasionally enhance the quality of generated images rather than degrade it.

Residual connections and why bypassing a block can help

In a Transformer, each block adds its output to its own input (a "residual connection"). Setting γ=0 means the block contributes nothing — its output is zeroed out. The surprising finding is that some blocks are actually hurting the image quality, because they were never optimally trained for the specific generation task. This is a diagnostic result that motivates Calibri: the default weights are suboptimal, and a simple scalar per block can fix it post-hoc without retraining.

In a second experiment, they scaled each block’s output by a scalar s ∈ {0, 0.25, 0.5, 0.75, 1.25, 1.5}. The result: for each DiT block, there exists an optimal scaling factor that improves performance over its original configuration.

The standard DiT architecture is sub-optimally weighted, and its performance can be significantly improved through a simple post-hoc calibration of its blocks.

Figure 3. Two scatter plots showing (a) Image Reward when each DiT block is ablated, and (b) Image Reward when each block output is scaled by various multipliers. Some blocks improve quality when removed; each block has an optimal scaling factor. — Figure 3. **Motivational Experiment**: Our findings reveal that the contribution of DiT blocks is not fully optimized. Their performance can be enhanced through a straightforward output scaling using a scalar multiplier.

Method: How Calibri Works

Figure 2. Illustration of DiT architectural components: (a) Standard DiT block with MHSA + Feed Forward layers modulated by alpha, beta, gamma vectors. (b) MM-DiT block with separate modulation for visual and text tokens. — Figure 2. Illustration of DiT architectural components. (a) Standard DiT block scheme. (b) MM-DiT block scheme with separate modulation vectors for visual and textual tokens.

Calibration Search Space: Three Levels of Granularity

Calibri defines calibration parameters c = ω ∪ {s_i} where ω denotes output-level weights and s_i denotes internal-layer parameters. Three granularities are introduced:

Block Scaling

Uniformly scales Attention + MLP outputs within the same block using a single shared scalar s. Coarsest granularity. 57 parameters for FLUX, converges in 200 iterations / 32 GPU-hours.

Layer Scaling

Scales individual layers within each block using distinct coefficients. More flexibility. 76 parameters for FLUX. Best balance of performance and training speed — most consistent improvements across all reward functions.

Gate Scaling

Gate-wise calibration for MM-DiT models with separate scaling for visual and text gates. 114 parameters for FLUX. Highest HPSv3 target metric but slower convergence (960 iters / 150 GPU-hours).

CMA-ES Parameter Search

To find optimal calibration coefficients, Calibri uses the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) — a gradient-free optimization algorithm. At each iteration, candidate solutions are sampled from a multivariate Gaussian N(μ, σ²C), evaluated via a reward model, and used to update μ, σ, and C.

Why CMA-ES instead of gradient descent?

Standard neural network training uses backpropagation — computing gradients of a loss function to update weights. But here the objective is an image reward model (HPSv3, Image Reward) that scores thousands of generated images, and there is no clean differentiable path from the scaling scalars to that score. CMA-ES treats the whole problem as a black box: it proposes many candidate scalar sets, generates images with each, measures the reward, then uses the top performers to update a Gaussian distribution from which the next generation of candidates is drawn. No gradients required — only image evaluations.

Figure 4. Calibration parameter search procedure: (I) Sample candidates from N(mu, sigma^2 C), (II) Generate image batches using weighted models, (III) Evaluate with reward model to get scores R^1...R^K, (IV) Update sampling parameters mu_new, sigma_new, C_new. — Figure 4. Illustration of calibration parameter search procedure. *Calibri* uses CMA-ES to iteratively refine a Gaussian sampling distribution, efficiently discovering optimal scaling coefficients.

Calibri Ensemble

Calibri Ensemble aggregates N differently calibrated models into a single sampler:

F {c i} (x, t, p) = \sum i=1 N ω i f θ s i (x, t, p|\emptyset)

For N=2 with block scaling, Calibri Ensemble generalizes Skip Layer Guidance (Spatiotemporal Guidance), making it a training-free case of Auto-guidance. Ensemble calibration consistently increases HPSv3 reward across all inference steps and shifts the optimal number of sampling steps from 30–50 to only 10–15.

What is "inference steps" in a diffusion model?

Diffusion models generate images by gradually denoising random noise across many steps. Each step runs the full neural network (an NFE — Number of Function Evaluations). More steps generally produce higher quality, but also take proportionally more time. FLUX's default is 30 steps; SD-3.5M uses 80. Calibri Ensemble shifts the quality-vs-steps curve so that only 10–15 steps are needed to match what 30–50 steps previously produced, effectively delivering a 2–3× speedup for free.

Experiments: Design Decisions & Ablations

All experiments use the Flux model, with optimization guided by HPSv3 reward. Train and test prompts come from T2I-Compbench++. Bucket size: 16, image resolution: 512, inference steps: 15 for training.

What is HPSv3? (Human Preference Score)

HPSv3 is a learned reward model trained on large-scale human pairwise preference data. Given two images for the same prompt, human annotators vote which they prefer. A neural network learns to predict those votes, producing a scalar score. HPSv3 is the third version of this metric, refined to better align with human aesthetics. Image Reward (IR) and Q-Align are alternative reward models: IR is also preference-trained, while Q-Align is a quality assessment model that outputs scores on a 1–5 scale. Using multiple reward metrics reduces the risk of over-optimizing for one metric's idiosyncrasies.

Table 1. Granularity Comparison on FLUX

Comparison of various granularity levels for internal-layer calibration applied to the Flux model. Evaluation on HPSv3 test set.
Scaling	N params	Iters	HPSv3	IR	Q-Align
—	—	—	11.41	1.15	4.85
Block	57	200	13.29	1.17	4.91
Layer	76	410	13.41	1.24	4.90
Gate	114	960	13.48	1.18	4.88

Gate scaling achieves the highest HPSv3 but underperforms on alternative rewards. Layer scaling yields the most consistent improvements and fastest training speed.

Figure 5. Image comparison grid showing Block, Layer, and Gate scaling results on three prompts: smiling baker, boats in calm water, tiger through brick opening. — Figure 5. Quantitative comparison of various granularity levels for internal-layer calibration applied to the Flux model.

Results: Consistent Gains Across All Models

Table 2. Cross-Model Performance

Quantitative evaluation of generation quality improvements across various baseline models. Calibri achieves superior metric scores while requiring fewer inference steps.
Model	Calibri	HPSv3	IR	Q-Align	NFE
FLUX	×	11.41	1.15	4.85	30
FLUX	✓	13.48	1.18	4.88	15
SD-3.5M	×	11.15	1.10	4.74	80
SD-3.5M	✓	14.10	1.17	4.91	30
Qwen Image	×	11.26	1.16	4.55	100
Qwen Image	✓	12.95	1.18	4.73	30

Why does human evaluation matter alongside automated metrics?

Reward models like HPSv3 can be "gamed" — a model could learn patterns that score highly on the metric without genuinely looking better to people. Human evaluation with real users (here 200 evaluators, 5,600 pairwise comparisons) provides ground truth on whether the improvement is perceptually real. A win rate above 50% means evaluators preferred Calibri's output more than the baseline when shown both side-by-side. The fact that Calibri achieves 51.9% (FLUX) and 54.6% (Qwen-Image) win rates confirms the metric gains are not artifacts.

Table 3. Human Evaluation Win Rates

Human evaluation: Calibri vs. Baselines win rates %. 200 users, 5,600 assessments, 150 HPDv3 test set prompts.
Methods	Overall Preference			Text Alignment
Methods	Calibri	Equal	Original	Calibri	Equal	Original
Flux	51.87	7.33	40.80	38.71	37.68	23.61
Qwen-Image	54.62	7.91	37.47	40.29	37.65	22.06

Evaluators decisively prefer Calibri in both Overall Preference and Text Alignment, confirming genuine perceptual gains (not reward artifacts). Calibrated models are also 2–3× faster than baselines.

Figure 6. Line chart comparing HPSv3 scores at various inference step counts for FLUX, FLUX + Calibri, and FLUX + 2 models Calibri. Calibri Ensemble achieves FLUX 30-50 step quality at only 10-15 steps. — Figure 6. Comparison between *Calibri Ensemble* and original model across several inference steps. Calibri achieves comparable quality at 10–15 steps vs. 30–50 steps for the baseline.

13.48

HPSv3 — FLUX + Calibri
(up from 11.41)

14.10

HPSv3 — SD-3.5M + Calibri
(up from 11.15)

51.9%

Human preference win rate
(vs. FLUX baseline)

2–3×

Inference speed-up
with Calibri Ensemble

Qualitative Comparison: Visual Quality Improvements

Qualitative comparisons across diverse prompts and model architectures confirm the consistent visual improvements reported in the quantitative evaluation. All models use the same NFE as in Table 2.

Figure 7. Six-column image grid comparing Flux, Flux+Calibri, SD-3.5M, SD-3.5M+Calibri, Qwen-Image, Qwen-Image+Calibri across five prompts: painted woman with hat and parasol, smiling chef, hot air balloons, traditional Chinese clothing, colorful hummingbird. — Figure 7. Qualitative evaluation of generation quality improvements across various baseline models. Models have same NFE as in Table 2.

Figure 8. Six-column comparison of SD3.5M, SD3.5M+Calibri, Flow-GRPO PickScore, Flow-GRPO+Calibri PickScore, Flow-GRPO GenEval, Flow-GRPO GenEval+Calibri HPSv3 across three prompts. — Figure 8. Qualitative comparison of *Calibri* and Flow-GRPO on SD-3.5M. *Calibri* achieves comparable performance with 10⁵ fewer parameters and can be combined with alignment methods to boost either the same or different target metrics.

What is Flow-GRPO?

Flow-GRPO is a reinforcement learning-based alignment method for flow matching diffusion models. It uses Group Relative Policy Optimization (GRPO) — a technique from LLM RLHF — to fine-tune the model's full weights toward a reward signal. Unlike Calibri, Flow-GRPO modifies millions to billions of parameters through gradient-based optimization and requires substantial GPU compute. The comparison here shows that Calibri (~100 scalars, gradient-free) can match Flow-GRPO's perceptual quality gains, and the two methods are complementary and composable.

Conclusion

In this work, we introduced Calibri, a novel and parameter-efficient approach to enhance the generative capabilities of Diffusion Transformers (DiTs). By uncovering the potential of a single learned scaling parameter to optimize the contributions of DiT components, we demonstrated that significant performance improvements can be achieved with minimal parameter modifications.

Framing the calibration process as a black-box optimization problem solved via the CMA-ES evolutionary strategy, Calibri adjusts only ~10² parameters while delivering consistently improved generation quality. Additionally, the proposed inference-time scaling technique, Calibri Ensemble, effectively combines calibrated models to further enhance results.

Our extensive empirical evaluation across a range of text-to-image diffusion models confirmed the effectiveness and efficiency of Calibri, highlighting its ability to achieve superior generative quality with reduced computational costs. Notably, Calibri successfully reduces the number of inference steps required for image generation while retaining high-quality outputs, making it a practical solution for real-world applications where computational efficiency is critical.

Key Takeaways

Parameter-Efficiency: Calibri adjusts only ~100 scalars — no gradient-based fine-tuning required. One-time offline cost of 32–356 GPU-hours (H100).
Universal Improvement: Consistent gains across FLUX, SD-3.5M, and Qwen-Image with HPSv3, Q-Align, and IR metrics simultaneously improved. Integrates with alignment methods like Flow-GRPO.
Inference Speed: Calibri Ensemble enables 2–3× faster inference while maintaining or exceeding baseline quality, shifting the optimal step count from 30–50 to 10–15 steps.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us