---
arxiv_id: 2604.03128
title: "Self-Distilled RLVR"
authors:
  - Chenxu Yang
  - Chuanyu Qin
  - Qingyi Si
  - Minghui Chen
  - Naibin Gu
  - Dingyu Yao
  - Zheng Lin
  - Weiping Wang
  - Jiaqi Wang
  - Nan Duan
difficulty: Advanced
tags:
  - LLM
  - Agent
  - Reasoning
published_at: 2026-04-03
flecto_url: https://flecto.zer0ai.dev/papers/2604.03128/
lang: en
---

> Self-Distilled RLVR

**Authors**: Chenxu Yang*, Chuanyu Qin*, Qingyi Si*, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan

## Abstract

### Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes. Recently, on-policy self-distillation (OPSD) was explored, where the same model serves as both teacher and student, with the teacher receiving privileged information to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. We identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation) : self-distillation determines token-level update magnitudes, while RLVR provides reliable update directions from environmental feedback. RLSD simultaneously harnesses the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

## Results

### Experimental Results

## Conclusion

### Conclusion

## References

### References

## Head Title

### Self-Distilled RLVR | Flecto

## Head Meta

RLSD combines RLVR and self-distillation to achieve token-level credit assignment, solving the information leakage problem of OPSD for better LLM training.

## Hero Button

### Read on arXiv ↗

### Jump to Results

## Contributions

### Key Contributions

## Contributions Card1

### Root Cause of OPSD Failure Identified

Formal proof shows that distribution matching under information asymmetry induces an irreducible mutual information gap — making privileged-information leakage structurally unavoidable in OPSD.

## Contributions Card2

### Token-Level Credit Assignment via Self-Distillation

RLSD repurposes the self-distillation teacher as a magnitude evaluator : environment reward governs direction, privileged teacher governs per-token update magnitude — eliminating leakage while keeping rich signal.

## Contributions Card3

### State-of-the-Art Multimodal Reasoning

RLSD achieves the best average accuracy across 5 multimodal reasoning benchmarks (MMMU, MathVista, MathVision, ZeroBench, Wemath), outperforming GRPO by +2.32% and Base LLM by +4.69% on average.

## Background

### Background & The Problem with OPSD

## Background Intro1

Reinforcement learning with verifiable rewards (RLVR) methods like GRPO have become central for training large reasoning models. Each trajectory receives only a single scalar reward from a verifier — a sparse signal — and all tokens within a response share the same advantage estimate, providing no token-level discrimination.

## Background Intro2

On-Policy Self-Distillation (OPSD) attempts to solve this by using the same model as both teacher (receiving reference answers) and student (generating answers independently). However, this creates a fundamental asymmetry: the teacher sees privileged information the student cannot access at inference time.

## Background Callout

### &#x26A0;&#xFE0F; Why OPSD Fails: Information Leakage

OPSD-trained models systematically reference privileged information unavailable at inference time . For example, a model trained with OPSD might produce: "I need to determine if the sample mean is within $1 of the population mean ... Given that the reference solution uses 9 values ..." — explicitly using the reference answer it should not know. This leakage increases monotonically during training, causing performance to peak within 10–20 steps and subsequently decline.

## Background Figure2

Figure 1. Performance on Qwen3-VL-8B-Instruct. (a) OPSD peaks early and degrades; RLSD inherits GRPO's stable optimization direction and OPSD's rich signal. (b) RLSD achieves best accuracy across all reasoning benchmarks.

## Background Table1

### Comparison of Training Paradigms

## Background Figure3

Figure 2. Leakage occurrence, validation performance, and KL divergence of OPSD and ablated variants. OPSD shows monotonically increasing leakage, declining performance, and stagnating KL divergence.

## Method

### The RLSD Method

### Combining the directional reliability of RLVR with the token-level richness of self-distillation

## Method Figure4

Figure 3. RLSD architecture. Left: the policy model runs in both Student Mode and Teacher Mode. Center: RLSD uses token-level log-probability differences (privileged info gain) to compute update magnitudes . Right: GRPO environment feedback determines update direction .

## Method Step1

### Step 1: Privileged Information Gain

For each token in the trajectory, compute the difference in log-probabilities between Teacher Mode (sees reference answer r ) and Student Mode (sees only question x ): Δ_t = sg(log π_θ(y_t|x,r,y<t) − log π_θ(y_t|x,y<t)). The stop-gradient ensures this serves purely as a weighting signal.

## Method Step2

### Step 2: Direction-Aware Evidence Reweighting

Weight each token by w_t = exp(sign(A)·Δ_t), where A is the sequence-level advantage from RLVR. When the trajectory was correct (A > 0), tokens the teacher supports get larger weight; when incorrect (A < 0), those tokens receive stronger blame. This implements Bayesian-style credit assignment.

## Method Step3

### Step 3: Clipped Credit Assignment

Following PPO's clipping philosophy, bound the token weight: w̃_t = A·(1−λ) + λ·clip(w_t, 1−ε_w, 1+ε_w). This prevents any single token from receiving excessive credit and avoids gradient explosion. No auxiliary distillation loss is added — only the internal redistribution of sequence-level credit.

## Method Algo

### Algorithm 1 — RLSD: Reinforcement Learning with Self-Distillation

## Theory

### Why OPSD Fails — Theoretical Analysis

## Theory Intro

The empirical failures of OPSD (leakage, performance degradation, KL stagnation) are not accidental — they follow from a structural deficiency in the distribution-matching objective. Two key results formalize this.

## Theory Theorem1

The OPSD objective and the ideal marginal objective differ by exactly an irreducible mutual information term:

## Theory Prop1

For any concrete realization of privileged information r, the per-sample gradient decomposes into a beneficial marginal-matching component g*(θ) and an r-specific deviation δ(θ; r):

## Theory Insight

### &#x1F4A1; The RLSD Fix

RLSD breaks the connection by not using the privileged teacher to set the gradient direction . Instead, the teacher's assessment is squashed into per-token weights (magnitudes) through the stop-gradient operation, while RLVR's environment reward remains the sole source of gradient direction. This eliminates the r-specific deviation δ(θ; r) from the gradient.

## Results Metric Mmmu

### Best result (+2.11 vs GRPO)

## Results Metric Mathvista

### Best result (+1.90 vs GRPO)

## Results Metric Mathvision

### Best result (+3.91 vs GRPO)

## Results Metric Avg

### Best average (+2.32 vs GRPO)

## Results Table2

### Multimodal Reasoning Benchmark Results (Qwen3-VL-8B-Instruct)

## Results Analysis

RLSD achieves the highest average accuracy at 56.18% (4K context), outperforming Base LLM by +4.69% and GRPO by +2.32% . Notably, OPSD actually performs worse than GRPO (52.49% vs 53.86%), confirming that naive self-distillation degrades performance. The simple linear combination GRPO+OPSD (52.91%) also fails to recover, demonstrating that the fix requires a fundamental redesign — not just mixing objectives.

## Results Figure5

Figure 4. Training dynamics over 200 steps. (a) RLSD reaches a higher reward ceiling. (b) RLSD maintains higher entropy (exploration diversity). (c) RLSD's clipping on credit assignment stays stable throughout training.

## Results Figure6

Figure 5. Token-level credit heatmaps. Top (correct trajectory): RLSD concentrates credit on decisive counting/subtraction steps. Bottom (incorrect trajectory): RLSD concentrates blame on the misread relation "3x = 28.5", correctly identifying the source of error.

## Related

### Related Work

## Related Rlvr

### Credit Assignment in RLVR

## Related Rlvr Body

GRPO and similar RLVR methods assign uniform sequence-level advantages to all tokens. Recent work seeks finer granularity using model-internal proxies such as entropy, uncertainty, key-token statistics, and attention weights. RLSD differs by using privileged information from a self-distillation teacher — but strictly as a magnitude signal, not a direction signal, to avoid leakage.

## Related Opd

### On-Policy Distillation

## Related Opd Body

OPD uses a separate larger teacher to provide dense token-level supervision, trading efficiency for richer signals. OPSD collapses teacher and student into the same model using privileged information. Self-Distilled Reasoner (Zhao et al., 2026) and MIMO-v2-Flash are contemporaneous works in this space. RLSD departs from all of these by abandoning distribution matching as the training objective for self-distillation.

## Conclusion Intro

This work identifies a fundamental limitation of on-policy self-distillation (OPSD) and proposes RLSD as a principled solution. Three key takeaways:

## Conclusion Bullet1

Root cause identified: OPSD's distribution-matching objective is structurally ill-posed under information asymmetry, inducing an irreducible mutual information gap and per-sample gradient deviation that drives leakage.

## Conclusion Bullet2

RLSD resolves the tension: By using environment rewards for direction and self-distillation only for per-token magnitude weights (via stop-gradient), RLSD avoids leakage entirely while retaining dense credit assignment.

## Conclusion Bullet3

Strong empirical results: RLSD achieves state-of-the-art on 5 multimodal reasoning benchmarks, with the highest average accuracy of 56.18% — a +4.69% improvement over the base LLM and +2.32% over GRPO.

## Conclusion Limitations

### Limitations & Future Work

## Conclusion Limitations Body

This paper's experiments focus on Qwen3-VL-8B-Instruct with multimodal reasoning tasks. Future work should evaluate RLSD on pure language reasoning models, larger scales, and diverse domains. The mixing coefficient λ and clip bound ε_w require tuning — adaptive methods for these hyperparameters would improve accessibility.

## Appendix

### Appendix — Theoretical Properties

## Appendix Table

### Table A1: Properties of OPSD Variants and RLSD