---
arxiv_id: 2603.24800
title: "Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration"
authors:
  - Danil Tokhchukov
  - Aysel Mirzoeva
  - Andrey Kuznetsov
  - Konstantin Sobolev
difficulty: Advanced
tags:
  - Diffusion
  - LLM
published_at: 2026-03-25
flecto_url: https://flecto.zer0ai.dev/papers/2603.24800/
lang: en
---

## Abstract

> In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance
        generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks.
        Building on this insight, we propose Calibri , a parameter-efficient approach that optimally
        calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem , which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters . Experimental results
        reveal that despite its lightweight design, Calibri consistently improves performance across
        various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation , all while maintaining high-quality
        outputs.

#### What is a Diffusion Transformer (DiT)?

Traditional image generators used convolutional networks for the denoising backbone. A Diffusion Transformer replaces that with a Transformer architecture — the same kind of attention-based network used in LLMs. The image is split into patches (like tokens in text), and the Transformer processes all patches simultaneously. FLUX and Stable Diffusion 3 are prominent examples. Because attention sees every patch at once, DiTs handle global composition better but come with higher compute cost.

## Motivation: DiT Blocks Are Sub-Optimally Weighted

Stable Flow identified “vital layers” within the transformer whose exclusion produces
            significant shifts in model outputs. Building on this, the authors systematically analyzed each DiT
            block’s contribution using the Qwen3 model and 64 diverse text prompts with the FLUX model.

For each DiT block l ∈ L , they bypassed its residual output (setting γ=0) and
            measured the effect on Image Reward score. Surprisingly, removing certain blocks can
            occasionally enhance the quality of generated images rather than degrade it .

#### Residual connections and why bypassing a block can help

In a Transformer, each block adds its output to its own input (a "residual connection"). Setting γ=0 means the block contributes nothing — its output is zeroed out. The surprising finding is that some blocks are actually hurting the image quality, because they were never optimally trained for the specific generation task. This is a diagnostic result that motivates Calibri: the default weights are suboptimal, and a simple scalar per block can fix it post-hoc without retraining.

In a second experiment, they scaled each block’s output by a scalar s ∈ {0, 0.25, 0.5, 0.75, 1.25, 1.5} . The result: for each DiT block, there exists an optimal scaling factor that improves performance over
            its original configuration .

> The standard DiT architecture is sub-optimally weighted, and its performance can be significantly
            improved through a simple post-hoc calibration of its blocks.

## Method: How Calibri Works

### Calibration Search Space: Three Levels of Granularity

Calibri defines calibration parameters c = ω ∪ {s i } where ω denotes
        output-level weights and s i denotes internal-layer parameters. Three granularities
        are introduced:

#### Block Scaling

Uniformly scales Attention + MLP outputs within the same block using a single shared scalar s . Coarsest granularity. 57 parameters for FLUX, converges in
          200 iterations / 32 GPU-hours.

#### Layer Scaling

Scales individual layers within each block using distinct coefficients. More flexibility. 76 parameters for FLUX. Best balance of performance and training speed — most
          consistent improvements across all reward functions.

#### Gate Scaling

Gate-wise calibration for MM-DiT models with separate scaling for visual and text gates. 114 parameters for FLUX. Highest HPSv3 target metric but slower convergence
          (960 iters / 150 GPU-hours).

### CMA-ES Parameter Search

To find optimal calibration coefficients, Calibri uses the Covariance Matrix Adaptation
        Evolution Strategy (CMA-ES) — a gradient-free optimization algorithm. At each iteration,
        candidate solutions are sampled from a multivariate Gaussian N(μ, σ 2 C) , evaluated via a reward model, and used to update
        μ, σ, and C .

#### Why CMA-ES instead of gradient descent?

Standard neural network training uses backpropagation — computing gradients of a loss function to update weights. But here the objective is an image reward model (HPSv3, Image Reward) that scores thousands of generated images, and there is no clean differentiable path from the scaling scalars to that score. CMA-ES treats the whole problem as a black box: it proposes many candidate scalar sets, generates images with each, measures the reward, then uses the top performers to update a Gaussian distribution from which the next generation of candidates is drawn. No gradients required — only image evaluations.

### Calibri Ensemble

Calibri Ensemble aggregates N differently calibrated models into a single sampler:

For N=2 with block scaling, Calibri Ensemble generalizes Skip Layer Guidance
        (Spatiotemporal Guidance), making it a training-free case of Auto-guidance. Ensemble calibration
        consistently increases HPSv3 reward across all inference steps and shifts the optimal number of
        sampling steps from 30–50 to only 10–15.

#### What is "inference steps" in a diffusion model?

Diffusion models generate images by gradually denoising random noise across many steps. Each step runs the full neural network (an NFE — Number of Function Evaluations). More steps generally produce higher quality, but also take proportionally more time. FLUX's default is 30 steps; SD-3.5M uses 80. Calibri Ensemble shifts the quality-vs-steps curve so that only 10–15 steps are needed to match what 30–50 steps previously produced, effectively delivering a 2–3× speedup for free.

## Experiments: Design Decisions & Ablations

All experiments use the Flux model, with optimization guided by HPSv3 reward. Train and test prompts
        come from T2I-Compbench++. Bucket size: 16, image resolution: 512, inference steps: 15 for training.

#### What is HPSv3? (Human Preference Score)

HPSv3 is a learned reward model trained on large-scale human pairwise preference data. Given two images for the same prompt, human annotators vote which they prefer. A neural network learns to predict those votes, producing a scalar score. HPSv3 is the third version of this metric, refined to better align with human aesthetics. Image Reward (IR) and Q-Align are alternative reward models: IR is also preference-trained, while Q-Align is a quality assessment model that outputs scores on a 1–5 scale. Using multiple reward metrics reduces the risk of over-optimizing for one metric's idiosyncrasies.

### Table 1. Granularity Comparison on FLUX

Gate scaling achieves the highest HPSv3 but underperforms on alternative rewards. Layer scaling
            yields the most consistent improvements and fastest training speed.

## Results: Consistent Gains Across All Models

### Table 2. Cross-Model Performance

#### Why does human evaluation matter alongside automated metrics?

Reward models like HPSv3 can be "gamed" — a model could learn patterns that score highly on the metric without genuinely looking better to people. Human evaluation with real users (here 200 evaluators, 5,600 pairwise comparisons) provides ground truth on whether the improvement is perceptually real. A win rate above 50% means evaluators preferred Calibri's output more than the baseline when shown both side-by-side. The fact that Calibri achieves 51.9% (FLUX) and 54.6% (Qwen-Image) win rates confirms the metric gains are not artifacts.

### Table 3. Human Evaluation Win Rates

Evaluators decisively prefer Calibri in both Overall Preference and Text Alignment, confirming genuine perceptual gains (not reward artifacts). Calibrated models are also 2–3× faster than baselines.

## Qualitative Comparison: Visual Quality Improvements

Qualitative comparisons across diverse prompts and model architectures confirm the consistent visual
        improvements reported in the quantitative evaluation. All models use the same NFE as in Table 2.

#### What is Flow-GRPO?

Flow-GRPO is a reinforcement learning-based alignment method for flow matching diffusion models. It uses Group Relative Policy Optimization (GRPO) — a technique from LLM RLHF — to fine-tune the model's full weights toward a reward signal. Unlike Calibri, Flow-GRPO modifies millions to billions of parameters through gradient-based optimization and requires substantial GPU compute. The comparison here shows that Calibri (~100 scalars, gradient-free) can match Flow-GRPO's perceptual quality gains, and the two methods are complementary and composable.

## Conclusion

In this work, we introduced Calibri , a novel and parameter-efficient approach to enhance the
          generative capabilities of Diffusion Transformers (DiTs). By uncovering the potential of a single
          learned scaling parameter to optimize the contributions of DiT components, we demonstrated that
          significant performance improvements can be achieved with minimal parameter modifications.

Framing the calibration process as a black-box optimization problem solved via the CMA-ES evolutionary
          strategy, Calibri adjusts only ~10 2 parameters while delivering consistently
          improved generation quality. Additionally, the proposed inference-time scaling technique, Calibri Ensemble , effectively combines calibrated models to further enhance results.

Our extensive empirical evaluation across a range of text-to-image diffusion models confirmed the
          effectiveness and efficiency of Calibri , highlighting its ability to achieve superior
          generative quality with reduced computational costs. Notably, Calibri successfully reduces
          the number of inference steps required for image generation while retaining high-quality outputs,
          making it a practical solution for real-world applications where computational efficiency is critical.

### Key Takeaways

- Parameter-Efficiency: Calibri adjusts only ~100 scalars — no gradient-based fine-tuning required. One-time offline cost of 32–356 GPU-hours (H100).

- Universal Improvement: Consistent gains across FLUX, SD-3.5M, and Qwen-Image with HPSv3, Q-Align, and IR metrics simultaneously improved. Integrates with alignment methods like Flow-GRPO.

- Inference Speed: Calibri Ensemble enables 2–3× faster inference while maintaining or exceeding baseline quality, shifting the optimal step count from 30–50 to 10–15 steps.
