Institute of Information Engineering, CAS & JD.COM
Formal proof shows that distribution matching under information asymmetry induces an irreducible mutual information gap — making privileged-information leakage structurally unavoidable in OPSD.
RLSD repurposes the self-distillation teacher as a magnitude evaluator: environment reward governs direction, privileged teacher governs per-token update magnitude — eliminating leakage while keeping rich signal.
RLSD achieves the best average accuracy across 5 multimodal reasoning benchmarks (MMMU, MathVista, MathVision, ZeroBench, Wemath), outperforming GRPO by +2.32% and Base LLM by +4.69% on average.
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes. Recently, on-policy self-distillation (OPSD) was explored, where the same model serves as both teacher and student, with the teacher receiving privileged information to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. We identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation): self-distillation determines token-level update magnitudes, while RLVR provides reliable update directions from environmental feedback. RLSD simultaneously harnesses the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
Reinforcement learning with verifiable rewards (RLVR) methods like GRPO have become central for training large reasoning models. Each trajectory receives only a single scalar reward from a verifier — a sparse signal — and all tokens within a response share the same advantage estimate, providing no token-level discrimination.
In reinforcement learning for LLMs, a "sparse reward" means the model only receives one feedback signal per entire response — a single ✓ or ✗ for correctness. All tokens in the response (potentially hundreds) share the same advantage score, so the model cannot tell which specific tokens were responsible for getting the answer right or wrong. It's like grading an essay only with a final score, never with per-sentence feedback.
On-Policy Self-Distillation (OPSD) attempts to solve this by using the same model as both teacher (receiving reference answers) and student (generating answers independently). However, this creates a fundamental asymmetry: the teacher sees privileged information the student cannot access at inference time.
OPSD-trained models systematically reference privileged information unavailable at inference time. For example, a model trained with OPSD might produce: "I need to determine if the sample mean is within $1 of the population mean ... Given that the reference solution uses 9 values ..." — explicitly using the reference answer it should not know. This leakage increases monotonically during training, causing performance to peak within 10–20 steps and subsequently decline.
In OPSD, the same model plays two roles simultaneously:
This is "information asymmetry": teacher and student have different information. The problem is that the training objective tries to make the student match the teacher's token-by-token probability distributions — but the teacher's distributions encode the answer. The student can never achieve this match without secretly memorizing answer patterns, causing "leakage."
| Method | Trajectory | Efficiency | Leakage Risk | Signal | Direction Anchoring |
|---|---|---|---|---|---|
| SFT | Off-policy | High | N/A | Rich | Teacher |
| RLVR (GRPO) | On-policy | High | N/A | Weak | Environment |
| OPD | On-policy | Low | N/A | Rich | Teacher |
| OPSD | On-policy | High | Severe | Rich | Teacher |
| RLSD (Ours) | On-policy | High | N/A | Rich | Environment |
Combining the directional reliability of RLVR with the token-level richness of self-distillation
For each token in the trajectory, compute the difference in log-probabilities between Teacher Mode (sees reference answer r) and Student Mode (sees only question x): Δ_t = sg(log π_θ(y_t|x,r,y<t) − log π_θ(y_t|x,y<t)). The stop-gradient ensures this serves purely as a weighting signal.
The stop-gradient operator sg(·) prevents backpropagation through the teacher's computation. Without it, the gradient signal from the teacher's logits would flow back and update model weights toward encoding the reference answer r, causing leakage. With sg(·), Δ_t acts purely as a weight that scales gradient magnitude — the gradient direction still comes entirely from RLVR's environment reward, which has no knowledge of r.
Weight each token by w_t = exp(sign(A)·Δ_t), where A is the sequence-level advantage from RLVR. When the trajectory was correct (A > 0), tokens the teacher supports get larger weight; when incorrect (A < 0), those tokens receive stronger blame. This implements Bayesian-style credit assignment.
The weight formula w_t = exp(sign(A)·Δ_t) has a natural Bayesian reading:
When A > 0 (correct): tokens the answer supports get w_t > 1 — more credit. When A < 0 (wrong): those tokens get stronger blame. This is exactly fine-grained credit assignment.
Following PPO's clipping philosophy, bound the token weight: w̃_t = A·(1−λ) + λ·clip(w_t, 1−ε_w, 1+ε_w). This prevents any single token from receiving excessive credit and avoids gradient explosion. No auxiliary distillation loss is added — only the internal redistribution of sequence-level credit.
Require: Policy π₀, dataset S={(xᵢ,rᵢ)}, verifier R(·,·), group size G, λ, ε_w
1: for each training step do
2: for each question x with privileged info r do
3: Sample G responses {y⁽¹⁾,...,y⁽ᴳ⁾} ~ π_θ(·|x)
4: // Sequence-level advantage from environment
5: Aᵢ = (R(x,y⁽ⁱ⁾) - μ_G) / σ_G [GRPO reward normalization]
6: // Token-level credit assignment via self-distillation
7: for each response y⁽ⁱ⁾ do
8: Compute teacher logits via forward pass with (x, r, y⁽ⁱ⁾)
9: Δₜ = sg(log π_θ(yₜ|x,r,y<ₜ) − log π_θ(yₜ|x,y<ₜ))
10: wₜ = exp(sign(Aᵢ)·Δₜ) [direction-aware weight]
11: w̃ₜ = Aᵢ·(1−λ) + λ·clip(wₜ, 1−ε_w, 1+ε_w)
12: end for
13: // Update policy maximizing E[∑ᵢ∑ₜ w̃ₜ·log π_θ(yₜ|x,y<ₜ)]
14: end for
15: end for
The empirical failures of OPSD (leakage, performance degradation, KL stagnation) are not accidental — they follow from a structural deficiency in the distribution-matching objective. Two key results formalize this.
The OPSD objective and the ideal marginal objective differ by exactly an irreducible mutual information term:
The equation L_OPSD = L* + I(Y_t; R | X, Y_<t) says: the OPSD training objective is always strictly larger than the ideal objective by exactly the mutual information between the current token and the privileged answer.
Why does this matter? Think of I(Y_t; R | X, Y_<t) as a "knowledge gap tax" — it represents how much the teacher's next-token prediction depends on knowing the answer. This gap is mathematically irreducible: the student hits a floor it cannot get through because it cannot condition on R.
Business analogy: Imagine training a sales rep to match an experienced colleague who secretly knows the client's budget. The rep can copy communication style, but the expert's pricing decisions inherently encode private knowledge — the rep can never fully close the gap without access to that private info.
The term I(Y_t; R | X, Y_{<t}) is the conditional mutual information between the current token and the privileged information under the teacher. Since the student cannot condition on R, this gap is irreducible — the student can never reach the teacher's objective. This explains why KL divergence stagnates: once the student approaches the marginal distribution, the residual gap I(·) forms an impassable floor.
For any concrete realization of privileged information r, the per-sample gradient decomposes into a beneficial marginal-matching component g*(θ) and an r-specific deviation δ(θ; r):
The gradient decomposition g(θ;r) = g*(θ) + δ(θ;r) splits training into two regimes:
Practical implication: OPSD appears to work early on (first 10–20 steps), then degrades. This creates a dangerous illusion that the method is sound when it's actually poisoning long-term training.
While the expected deviation is zero (E_r[δ] = 0), stochastic gradient descent operates on individual samples. Each mini-batch injects r-dependent noise. Early in training this is benign (the beneficial term dominates), but as training progresses, path-dependent accumulation drives the model toward regions that encode the x → r correlation — triggering information leakage.
RLSD breaks the connection by not using the privileged teacher to set the gradient direction. Instead, the teacher's assessment is squashed into per-token weights (magnitudes) through the stop-gradient operation, while RLVR's environment reward remains the sole source of gradient direction. This eliminates the r-specific deviation δ(θ; r) from the gradient.
| Method | MMMU | MathVista | MathVision | ZeroBench | Wemath | Avg. |
|---|---|---|---|---|---|---|
| Base LLM | 62.44 | 73.80 | 47.37 | 19.76 | 54.10 | 51.49 |
| GRPO | 65.11 | 76.20 | 48.82 | 22.60 | 56.57 | 53.86 |
| OPSD | 63.82 | 75.10 | 47.53 | 21.06 | 54.95 | 52.49 |
| SDPO | 65.11 | 74.00 | 47.27 | 25.15 | 52.19 | 52.74 |
| GRPO+OPSD | 63.22 | 75.90 | 48.52 | 22.16 | 54.76 | 52.91 |
| RLSD (Ours) | 67.22 | 78.10 | 52.73 | 24.85 | 58.00 | 56.18 |
RLSD achieves the highest average accuracy at 56.18% (4K context), outperforming Base LLM by +4.69% and GRPO by +2.32%. Notably, OPSD actually performs worse than GRPO (52.49% vs 53.86%), confirming that naive self-distillation degrades performance. The simple linear combination GRPO+OPSD (52.91%) also fails to recover, demonstrating that the fix requires a fundamental redesign — not just mixing objectives.
This is counter-intuitive: OPSD (52.49% avg) underperforms GRPO (53.86% avg) despite having access to richer information (reference answers). The Proposition 1 analysis explains why: the r-specific gradient deviation δ(θ;r) accumulates over training, corrupting the model's decision boundaries. The model learns to implicitly rely on reference answer patterns that don't exist at inference time — so performance degrades relative to GRPO, which never had this corruption. More information + wrong training objective = worse outcome.
This work identifies a fundamental limitation of on-policy self-distillation (OPSD) and proposes RLSD as a principled solution. Three key takeaways:
This paper's experiments focus on Qwen3-VL-8B-Instruct with multimodal reasoning tasks. Future work should evaluate RLSD on pure language reasoning models, larger scales, and diverse domains. The mixing coefficient λ and clip bound ε_w require tuning — adaptive methods for these hyperparameters would improve accessibility.
| Property | OPSD (Frozen) | OPSD (Online) | RLSD |
|---|---|---|---|
| (a) Objective stability | ✓ | ✗ | ✓ |
| (b) Sustained improvement | ✗ | ✓ | ✓ |
| (c) Leakage-free training | ✗ | ✗ | ✓ |
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.