Self-Distilled RLVR

Key Contributions

🧬

Root Cause of OPSD Failure Identified

Formal proof shows that distribution matching under information asymmetry induces an irreducible mutual information gap — making privileged-information leakage structurally unavoidable in OPSD.

⚖️

Token-Level Credit Assignment via Self-Distillation

RLSD repurposes the self-distillation teacher as a magnitude evaluator: environment reward governs direction, privileged teacher governs per-token update magnitude — eliminating leakage while keeping rich signal.

🏆

State-of-the-Art Multimodal Reasoning

RLSD achieves the best average accuracy across 5 multimodal reasoning benchmarks (MMMU, MathVista, MathVision, ZeroBench, Wemath), outperforming GRPO by +2.32% and Base LLM by +4.69% on average.

Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes. Recently, on-policy self-distillation (OPSD) was explored, where the same model serves as both teacher and student, with the teacher receiving privileged information to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. We identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation): self-distillation determines token-level update magnitudes, while RLVR provides reliable update directions from environmental feedback. RLSD simultaneously harnesses the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

Background & The Problem with OPSD

Reinforcement learning with verifiable rewards (RLVR) methods like GRPO have become central for training large reasoning models. Each trajectory receives only a single scalar reward from a verifier — a sparse signal — and all tokens within a response share the same advantage estimate, providing no token-level discrimination.

What is "sparse reward" in RL for LLMs?

In reinforcement learning for LLMs, a "sparse reward" means the model only receives one feedback signal per entire response — a single ✓ or ✗ for correctness. All tokens in the response (potentially hundreds) share the same advantage score, so the model cannot tell which specific tokens were responsible for getting the answer right or wrong. It's like grading an essay only with a final score, never with per-sentence feedback.

On-Policy Self-Distillation (OPSD) attempts to solve this by using the same model as both teacher (receiving reference answers) and student (generating answers independently). However, this creates a fundamental asymmetry: the teacher sees privileged information the student cannot access at inference time.

⚠️ Why OPSD Fails: Information Leakage

OPSD-trained models systematically reference privileged information unavailable at inference time. For example, a model trained with OPSD might produce: "I need to determine if the sample mean is within $1 of the population mean ... Given that the reference solution uses 9 values ..." — explicitly using the reference answer it should not know. This leakage increases monotonically during training, causing performance to peak within 10–20 steps and subsequently decline.

What is "information asymmetry" in self-distillation?

In OPSD, the same model plays two roles simultaneously:

Teacher mode: receives the question plus the reference answer r — like an open-book exam
Student mode: receives only the question — like a closed-book exam

This is "information asymmetry": teacher and student have different information. The problem is that the training objective tries to make the student match the teacher's token-by-token probability distributions — but the teacher's distributions encode the answer. The student can never achieve this match without secretly memorizing answer patterns, causing "leakage."

Performance comparison charts — **Figure 1.** Performance on Qwen3-VL-8B-Instruct. (a) OPSD peaks early and degrades; RLSD inherits GRPO's stable optimization direction and OPSD's rich signal. (b) RLSD achieves best accuracy across all reasoning benchmarks.

Comparison of Training Paradigms

Method	Trajectory	Efficiency	Leakage Risk	Signal	Direction Anchoring
SFT	Off-policy	High	N/A	Rich	Teacher
RLVR (GRPO)	On-policy	High	N/A	Weak	Environment
OPD	On-policy	Low	N/A	Rich	Teacher
OPSD	On-policy	High	Severe	Rich	Teacher
RLSD (Ours)	On-policy	High	N/A	Rich	Environment

Leakage occurrence, performance, and KL divergence comparison — **Figure 2.** Leakage occurrence, validation performance, and KL divergence of OPSD and ablated variants. OPSD shows monotonically increasing leakage, declining performance, and stagnating KL divergence.

The RLSD Method

Combining the directional reliability of RLVR with the token-level richness of self-distillation

RLSD architecture diagram — **Figure 3.** RLSD architecture. Left: the policy model runs in both Student Mode and Teacher Mode. Center: RLSD uses token-level log-probability differences (privileged info gain) to compute update *magnitudes*. Right: GRPO environment feedback determines update *direction*.

Step 1: Privileged Information Gain

For each token in the trajectory, compute the difference in log-probabilities between Teacher Mode (sees reference answer r) and Student Mode (sees only question x): Δ_t = sg(log π_θ(y_t|x,r,y<t) − log π_θ(y_t|x,y<t)). The stop-gradient ensures this serves purely as a weighting signal.

What is "stop-gradient" (sg) and why does it fix leakage?

The stop-gradient operator sg(·) prevents backpropagation through the teacher's computation. Without it, the gradient signal from the teacher's logits would flow back and update model weights toward encoding the reference answer r, causing leakage. With sg(·), Δ_t acts purely as a weight that scales gradient magnitude — the gradient direction still comes entirely from RLVR's environment reward, which has no knowledge of r.

Step 2: Direction-Aware Evidence Reweighting

Weight each token by w_t = exp(sign(A)·Δ_t), where A is the sequence-level advantage from RLVR. When the trajectory was correct (A > 0), tokens the teacher supports get larger weight; when incorrect (A < 0), those tokens receive stronger blame. This implements Bayesian-style credit assignment.

The Bayesian Interpretation of Direction-Aware Weighting

The weight formula w_t = exp(sign(A)·Δ_t) has a natural Bayesian reading:

π_θ(y_t | x, y_<t) is the model's prior probability for token y_t (without answer)
π_θ(y_t | x, r, y_<t) is the posterior probability given the answer
Δ_t = log(posterior/prior) measures how much the answer "updates" belief about this token

When A > 0 (correct): tokens the answer supports get w_t > 1 — more credit. When A < 0 (wrong): those tokens get stronger blame. This is exactly fine-grained credit assignment.

Step 3: Clipped Credit Assignment

Following PPO's clipping philosophy, bound the token weight: w̃_t = A·(1−λ) + λ·clip(w_t, 1−ε_w, 1+ε_w). This prevents any single token from receiving excessive credit and avoids gradient explosion. No auxiliary distillation loss is added — only the internal redistribution of sequence-level credit.

Algorithm 1 — RLSD: Reinforcement Learning with Self-Distillation

Require: Policy π₀, dataset S={(xᵢ,rᵢ)}, verifier R(·,·), group size G, λ, ε_w

1: for each training step do
2:   for each question x with privileged info r do
3:     Sample G responses {y⁽¹⁾,...,y⁽ᴳ⁾} ~ π_θ(·|x)
4:     // Sequence-level advantage from environment
5:     Aᵢ = (R(x,y⁽ⁱ⁾) - μ_G) / σ_G   [GRPO reward normalization]
6:     // Token-level credit assignment via self-distillation
7:     for each response y⁽ⁱ⁾ do
8:       Compute teacher logits via forward pass with (x, r, y⁽ⁱ⁾)
9:       Δₜ = sg(log π_θ(yₜ|x,r,y<ₜ) − log π_θ(yₜ|x,y<ₜ))
10:      wₜ = exp(sign(Aᵢ)·Δₜ)   [direction-aware weight]
11:      w̃ₜ = Aᵢ·(1−λ) + λ·clip(wₜ, 1−ε_w, 1+ε_w)
12:    end for
13:    // Update policy maximizing E[∑ᵢ∑ₜ w̃ₜ·log π_θ(yₜ|x,y<ₜ)]
14:  end for
15: end for

Token-level credit assignment illustration

Why OPSD Fails — Theoretical Analysis

The empirical failures of OPSD (leakage, performance degradation, KL stagnation) are not accidental — they follow from a structural deficiency in the distribution-matching objective. Two key results formalize this.

Theorem 1 — KL Decomposition

The OPSD objective and the ideal marginal objective differ by exactly an irreducible mutual information term:

Theorem 1 — Plain Language Explanation

The equation L_OPSD = L* + I(Y_t; R | X, Y_<t) says: the OPSD training objective is always strictly larger than the ideal objective by exactly the mutual information between the current token and the privileged answer.

Why does this matter? Think of I(Y_t; R | X, Y_<t) as a "knowledge gap tax" — it represents how much the teacher's next-token prediction depends on knowing the answer. This gap is mathematically irreducible: the student hits a floor it cannot get through because it cannot condition on R.

Business analogy: Imagine training a sales rep to match an experienced colleague who secretly knows the client's budget. The rep can copy communication style, but the expert's pricing decisions inherently encode private knowledge — the rep can never fully close the gap without access to that private info.

\mathcal{L}_{\text{OPSD}} = \mathcal{L}^* + I(Y_t; R \mid X, Y_{<t})

The term I(Y_t; R | X, Y_{<t}) is the conditional mutual information between the current token and the privileged information under the teacher. Since the student cannot condition on R, this gap is irreducible — the student can never reach the teacher's objective. This explains why KL divergence stagnates: once the student approaches the marginal distribution, the residual gap I(·) forms an impassable floor.

Proposition 1 — Per-Sample Gradient Decomposition

For any concrete realization of privileged information r, the per-sample gradient decomposes into a beneficial marginal-matching component g*(θ) and an r-specific deviation δ(θ; r):

Proposition 1 — Two-Phase Training Dynamics Explained

The gradient decomposition g(θ;r) = g*(θ) + δ(θ;r) splits training into two regimes:

Phase 1 (early training): The student is far from the teacher, so g*(θ) — the "good" marginal-matching gradient — dominates. Training looks productive: performance improves.
Phase 2 (late training): As the student gets closer, δ(θ;r) — the r-specific deviation — starts dominating. Since SGD processes one mini-batch at a time, these r-specific deviations accumulate path-dependently, gradually pushing the model to encode correlations between questions and reference answers. This is the leakage.

Practical implication: OPSD appears to work early on (first 10–20 steps), then degrades. This creates a dangerous illusion that the method is sound when it's actually poisoning long-term training.

g(\theta; r) = g^*(\theta) + \delta(\theta; r)

While the expected deviation is zero (E_r[δ] = 0), stochastic gradient descent operates on individual samples. Each mini-batch injects r-dependent noise. Early in training this is benign (the beneficial term dominates), but as training progresses, path-dependent accumulation drives the model toward regions that encode the x → r correlation — triggering information leakage.

💡 The RLSD Fix

RLSD breaks the connection by not using the privileged teacher to set the gradient direction. Instead, the teacher's assessment is squashed into per-token weights (magnitudes) through the stop-gradient operation, while RLVR's environment reward remains the sole source of gradient direction. This eliminates the r-specific deviation δ(θ; r) from the gradient.

Experimental Results

67.22

MMMU

Best result (+2.11 vs GRPO)

78.10

MathVista

Best result (+1.90 vs GRPO)

52.73

MathVision

Best result (+3.91 vs GRPO)

56.18

Average

Best average (+2.32 vs GRPO)

Multimodal Reasoning Benchmark Results (Qwen3-VL-8B-Instruct)

Method	MMMU	MathVista	MathVision	ZeroBench	Wemath	Avg.
Base LLM	62.44	73.80	47.37	19.76	54.10	51.49
GRPO	65.11	76.20	48.82	22.60	56.57	53.86
OPSD	63.82	75.10	47.53	21.06	54.95	52.49
SDPO	65.11	74.00	47.27	25.15	52.19	52.74
GRPO+OPSD	63.22	75.90	48.52	22.16	54.76	52.91
RLSD (Ours)	67.22	78.10	52.73	24.85	58.00	56.18

RLSD achieves the highest average accuracy at 56.18% (4K context), outperforming Base LLM by +4.69% and GRPO by +2.32%. Notably, OPSD actually performs worse than GRPO (52.49% vs 53.86%), confirming that naive self-distillation degrades performance. The simple linear combination GRPO+OPSD (52.91%) also fails to recover, demonstrating that the fix requires a fundamental redesign — not just mixing objectives.

Why does OPSD perform worse than pure GRPO?

This is counter-intuitive: OPSD (52.49% avg) underperforms GRPO (53.86% avg) despite having access to richer information (reference answers). The Proposition 1 analysis explains why: the r-specific gradient deviation δ(θ;r) accumulates over training, corrupting the model's decision boundaries. The model learns to implicitly rely on reference answer patterns that don't exist at inference time — so performance degrades relative to GRPO, which never had this corruption. More information + wrong training objective = worse outcome.

Training dynamics: reward, entropy, clip ratio — **Figure 4.** Training dynamics over 200 steps. (a) RLSD reaches a higher reward ceiling. (b) RLSD maintains higher entropy (exploration diversity). (c) RLSD's clipping on credit assignment stays stable throughout training.

**Figure 5.** Token-level credit heatmaps. *Top (correct trajectory):* RLSD concentrates credit on decisive counting/subtraction steps. *Bottom (incorrect trajectory):* RLSD concentrates blame on the misread relation "3x = 28.5", correctly identifying the source of error.

Related Work

Credit Assignment in RLVR

GRPO and similar RLVR methods assign uniform sequence-level advantages to all tokens. Recent work seeks finer granularity using model-internal proxies such as entropy, uncertainty, key-token statistics, and attention weights. RLSD differs by using privileged information from a self-distillation teacher — but strictly as a magnitude signal, not a direction signal, to avoid leakage.

On-Policy Distillation

OPD uses a separate larger teacher to provide dense token-level supervision, trading efficiency for richer signals. OPSD collapses teacher and student into the same model using privileged information. Self-Distilled Reasoner (Zhao et al., 2026) and MIMO-v2-Flash are contemporaneous works in this space. RLSD departs from all of these by abandoning distribution matching as the training objective for self-distillation.

Conclusion

This work identifies a fundamental limitation of on-policy self-distillation (OPSD) and proposes RLSD as a principled solution. Three key takeaways:

Root cause identified: OPSD's distribution-matching objective is structurally ill-posed under information asymmetry, inducing an irreducible mutual information gap and per-sample gradient deviation that drives leakage.
RLSD resolves the tension: By using environment rewards for direction and self-distillation only for per-token magnitude weights (via stop-gradient), RLSD avoids leakage entirely while retaining dense credit assignment.
Strong empirical results: RLSD achieves state-of-the-art on 5 multimodal reasoning benchmarks, with the highest average accuracy of 56.18% — a +4.69% improvement over the base LLM and +2.32% over GRPO.

Limitations & Future Work

This paper's experiments focus on Qwen3-VL-8B-Instruct with multimodal reasoning tasks. Future work should evaluate RLSD on pure language reasoning models, larger scales, and diverse domains. The mixing coefficient λ and clip bound ε_w require tuning — adaptive methods for these hyperparameters would improve accessibility.

Appendix — Theoretical Properties

Table A1: Properties of OPSD Variants and RLSD

Property	OPSD (Frozen)	OPSD (Online)	RLSD
(a) Objective stability	✓	✗	✓
(b) Sustained improvement	✗	✓	✓
(c) Leakage-free training	✗	✗	✓

References

Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024.
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025.
Kimi Team. Kimi k1.5: Scaling Reinforcement Learning with LLMs. 2025.
Agarwal et al. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. ICML 2024.
Lu, Kevin. On-Policy Distillation. Thinking Machines Lab, 2025.
Core Team. MIMO-v2-Flash Technical Report. 2026.
Zhao et al. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models. 2026.
Hübotter et al. Reinforcement Learning from Verifiable Rewards. 2025.
Schulman et al. Proximal Policy Optimization Algorithms. 2017.
Xie et al. Unlocking Exploration in RLVR: Uncertainty-Aware Advantage Shaping. 2025.
Li et al. Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment. 2025.
Kingma and Ba. Adam: A Method for Stochastic Optimization. 2015.
Lin et al. MMFineReason: Closing the Multimodal Reasoning Gap. 2025.
Yue et al. MMMU: A Massive Multi-Discipline Multimodal Understanding Benchmark. 2023.
Lu et al. MathVista: Evaluating Mathematical Reasoning of Foundation Models. 2023.
Wang et al. Measuring Multimodal Mathematical Reasoning with Math-Vision. 2024.
Roberts et al. ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models. 2025.
Qiao et al. Wemath: We Math Dataset. 2024.
Bai et al. Qwen-VL: A Versatile Vision-Language Model. 2025.
Sheng et al. HybridFlow: A Flexible and Efficient RLHF Framework. 2024.
Zheng et al. EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework. 2025.
Lightman et al. Let's Verify Step by Step. ICLR 2024.