arXiv:2604.12627 · LLM Reasoning · RL

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu — TJUNLP Lab (Tianjin University) · Baidu Inc. · Institute of Information Engineering, CAS

RLVR keeps stalling on hard reasoning problems because rewards are sparse, and adding longer prefixes or richer abstractions just introduces redundancy and confusion. KnowRL reframes hinting as a minimal-sufficient guidance problem — decompose hints into atomic Knowledge Points (KPs) and pick a compact, interaction-aware subset with Constrained Subset Search (CSS). At 1.5B scale, KnowRL-Nemotron hits 70.08 average accuracy on eight reasoning benchmarks without KPs at inference (+9.63 vs Nemotron-1.5B) and 74.16 with selected KPs — a new SOTA.

Read on arXiv ↗ Code on GitHub ↗

70.08

Avg Acc w/o KP (+9.63)

74.16

Avg Acc w/ CSS KP (SOTA)

Reasoning Benchmarks

1.5B

Model Parameters

Abstract

What the paper proposes

RLVR (Reinforcement Learning with Verifiable Rewards) improves the reasoning ability of large language models, but it stalls on hard problems because reward signals are sparse — every rollout fails, so group-based objectives like GRPO see zero gradient. Existing hint-based RL methods patch this by injecting partial solutions or abstract templates, but they treat guidance as a quantity-expansion problem: longer prefixes, richer templates. This introduces redundancy, cross-hint inconsistency, and extra inference overhead.

KnowRL (Knowledge-Guided Reinforcement Learning) reframes hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes hints into atomic Knowledge Points (KPs) — the indispensable mathematical principles for each problem — and uses Constrained Subset Search (CSS) to pick a compact, interaction-aware subset. The paper also identifies a pruning interaction paradox: removing any single KP may help, but removing several at once can hurt — so robust subset curation matters.

Training KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B on CSS-curated data reaches 70.08 average accuracy across eight reasoning benchmarks without KP hints at inference time — already +9.63 over Nemotron-1.5B and +1.50 over JustRL. With selected KPs at inference it further improves to 74.16, a new state of the art at the 1.5B scale. Model, curated data, and code are open-sourced at github.com/Hasuer/KnowRL.

Section 01 · Motivation

Three Limitations of Existing Hint-Based RL

Recent methods — fixed-ratio prefix hints (QuestA, POPE), adaptive solution hints (StepHint, UFT), and abstraction-based hints (TAPO, Guide, Scaf-GRPO) — all treat stronger guidance as longer prefixes or richer abstractions. Figure 1 illustrates why this axis is the wrong one: it reveals three failure modes that share a common root.

Gap 01

Critical-segment effect

Most of a prefix-style hint is redundant. Accuracy stays flat as you add more tokens until a short decisive segment — the 'critical segment' — is included, and then it jumps. Scaling prefix length thus wastes tokens and obscures which piece of the hint actually carried the signal.

Gap 02

Cross-hint inconsistency

Different hints can suggest different solution routes for the same problem — e.g., the Cayley–Menger determinant vs. 3D coordinate embedding vs. the dispehnoid volume formula for a tetrahedron volume. Layering several of these simultaneously confuses the model rather than helping it.

Gap 03

Guidance-efficiency trade-off

Abstraction-based hints require an external teacher to generate rich templates at inference time. The resulting extra compute and latency buy only marginal accuracy — a poor trade when the goal is a deployable small model.

Section 02 · Reframing

From 'More Guidance' to 'Minimal-Sufficient Guidance'

RLVR improves LLM reasoning by optimizing rule-based correctness without human preference labels. It is scalable and cheap to supervise, but it suffers from a key bottleneck: reward sparsity on hard problems. When a complex question yields uniformly incorrect rollouts, group-based methods such as GRPO produce zero advantage, so a large fraction of training data contributes no gradient.

The community's response has been hint-based RL: inject auxiliary guidance into prompts to raise the chance of producing a reward-yielding rollout. Three flavors exist — fixed-ratio solution-prefix hints, adaptive solution hints, and abstraction-based hints — but all scale guidance along a token-quantity axis. As Figure 1 shows, that axis hits diminishing returns, layering conflicts, and inference-cost walls.

KnowRL asks a different question: instead of how much guidance, ask what is the smallest sufficient set? Decompose guidance into atomic Knowledge Points (KPs) and build a problem-specific subset that is minimal but interaction-aware. This reframing is what ties the three motivation failures above into one design principle.

Contributions

Reformulate hint design as a minimal-sufficient guidance problem — not a quantity-expansion problem.
Introduce Knowledge Points (KPs) as atomic units of guidance, extracted per problem from verified correct solutions.
Identify and quantify the pruning interaction paradox: removing one KP can help, but removing several at once can hurt more than expected.
Propose two interaction-aware selection strategies, Constrained Subset Search (CSS) and Consensus-Based Robust Selection (CBRS), and analyze their trade-offs.
Train KnowRL-Nemotron-1.5B with CSS-selected hints, reaching 70.08 / 74.16 average accuracy on eight reasoning benchmarks — a new SOTA at 1.5B scale.

Section 04 · Framework

The KnowRL Framework

At a high level, KnowRL follows a simple end-to-end workflow. For each training problem it (1) constructs candidate Knowledge Points, (2) removes leakage and redundancy to obtain a compact problem-specific subset, and (3) uses the curated subset as hint data for RL training only when guidance is needed. Guidance is attached in the prompt under a ## Hint header (see the augmented prompt example in Appendix C.2 below).

Generate correct solutions

For each problem, sample responses from DeepSeek-R1 until at least one verified correct solution is obtained. This grounds subsequent KP extraction in a real, working chain of reasoning.

Extract raw Knowledge Points

Given the problem and a verified correct solution, prompt DeepSeek-R1 to extract only the indispensable mathematical principles required to solve the problem. This yields an initial candidate KP set K = {k₁, k₂, …, kₙ}.

Leakage verification

To prevent information leakage, DeepSeek-R1 verifies each KP as an automated reviewer. Failed cases are manually revised so that every retained KP is generalizable and not instance-bound.

Example augmented prompt with ## Hint section listing Knowledge Points — **Appendix C.2.** An augmented prompt used during RL training. The original problem is followed by a `## Hint` section listing curated Knowledge Points — unit conversion, key considerations, knowledge-point references, etc.

Takeaway

With all raw KPs injected (avg 5.86 KPs per problem), OpenMath-Nemotron-1.5B only moves from 60.46 → 61.03 average accuracy across eight benchmarks. Raw KP construction alone is not enough — we must pick a problem-specific subset carefully.

Section 05 · Subset Selection

Problem-wise KP Subset Selection

For each problem with candidate set K, KnowRL estimates accuracies offline under three configurations: empty set \(\emptyset\), full set K, and leave-one-out K \(\setminus\) {kᵢ}. The goal is a parameterized decision operator \(D_\varepsilon : K \to K^* \subseteq K\) that picks the most helpful subset for each problem. A naive 'Max-Score' baseline chooses among {∅, K, K \ {kᵢ}} — simple but restricts each problem to only three configurations.

S-LOO

Strict Leave-One-Out

With tolerance ε = 0, remove every kᵢ whose single-KP ablation strictly improves over the full-set accuracy. Aggressive pruning, but sensitive to finite-sample noise and prone to over-pruning.

T-LOO

Tolerant Leave-One-Out

With tolerance ε = 1/32, relax the threshold to absorb sampling randomness. T-LOO retains more KPs than S-LOO but still under-performs CSS and CBRS, because jointly removing 'non-essential' KPs can introduce conflicts and cause larger-than-expected accuracy drops.

Analysis plots: non-additive interaction probability vs removed KPs, and tie tolerance vs intersection rate — **Figure 2.** (Left) probability of non-additive interaction and average performance drop as the number of removed KPs m grows — both climb sharply, showing that joint removals compound in unexpected ways. (Right) overall intersection rate and average selected KP count versus tie tolerance — loose tolerance inflates KP count while collapsing cross-run consistency.

CSS — Constrained Subset Search

Exhaustive search over 2^|K| subsets is infeasible, so CSS restricts the search space to non-degrading KPs H = {kᵢ | A₋ᵢ ≥ max(A_K, A_∅)}. Within H, near-optimal removals N = {kᵢ ∈ H | A₋ᵢ ≥ A_max − δ} are pruned directly; the remaining candidates C = H \ N are enumerated exhaustively (|C| stays small in practice). The final configuration is chosen as argmax_S A(S) over all enumerated subsets plus ∅ and K. CSS achieves the best offline trade-off in Table 1 — highest average accuracy with compact hint sets.

CBRS — Consensus-Based Robust Selection

Instead of averaging over 8 × 32 samples, CBRS treats each of the 8 independent evaluation runs as a separate vote. For each run it ranks candidate configurations; ties are broken by smallest score variance across the 8 runs, then by smaller hint-set size. CBRS gives strong performance with a compact KP set, but as the ablations show, it slightly under-performs CSS on the hardest benchmarks.

Violin plots of accuracy distribution across difficulty buckets for test and training sets — **Figure 3.** Difficulty-bucket analysis (buckets defined by no-KP accuracy). Injecting all KPs lifts the mean but leaves regressing tails in several buckets; CSS and CBRS selection tighten the distribution and lift the mean together on both test (a) and training (b) splits.

Section 06 · Experiments

Training KnowRL-Nemotron-1.5B

KnowRL is evaluated on four axes — training data construction, training setup, evaluation protocol, and final benchmark results. Training uses QuestA (8.8k instances after dedup, CSS-selected KPs) with OpenMath-Nemotron-1.5B as the backbone; RL runs for 2,960 steps on 8× H100 nodes (64 GPUs total) for about 13 days of wall-clock time.

Headline Numbers

70.08

Avg Acc without KPs at inference

+9.63 over Nemotron-1.5B, +1.50 over JustRL

74.16

Avg Acc with CSS-selected KPs

New state of the art at the 1.5B scale

~13d

Training wall-clock

8 × H100 nodes · 64 GPUs · 2,960 steps

Training Setup

Batch size 256 with four updates per step; constant learning rate 1e-6; clip-ratio range [0.8, 1.28]; 8 samples per question at top_p = 1.0, T = 1.0; max response length 24k; token-mean loss; no KL loss and no entropy bonus; dynamic sampling enabled. Entropy annealing is used: with clip_high = 0.28, entropy rises early (encouraging exploration), then begins to decrease at step 2,590 as the model searches for optimal paths. Evaluation follows the JustRL protocol — rule-based verifier (mathverify 0.8.0) with fallback to Compass Verifier-3B; 32k max tokens, top_p = 0.7, T = 0.9, 8 samples per problem on AIME24/25, HMMT25, CMIMC25, BrumoS25, MATH-500, Olympiad-Bench.

Main Results

Across all eight benchmarks, KnowRL-Nemotron-1.5B consistently posts the strongest overall numbers. Even without KP hints at inference it reaches a 70.08 average — clearly surpassing Nemotron-1.5B by +9.63 and JustRL by +1.50. With CBRS-selected KPs the average rises to 73.46 and with CSS-selected KPs to 74.16. Gains are particularly pronounced on competition-style benchmarks: under CSS, +15.11 on AIME25, +12.98 on HMMT25, and +15.49 on CMIMC25.

Per-query correct-count distribution over training set under three conditions — **Figure 4.** Distribution of per-query correct counts (out of 8 rollouts) over the training set, under three conditions. The backbone suffers severe reward sparsity: 41.21% of queries produce zero correct answers and only 1.35% are solved consistently. KnowRL training alone collapses the zero-correct bucket to 13.00% and raises the 8/8-correct bucket to 34.28%. Adding KPs at inference pushes 8/8-correct further to 51.07%.

Takeaway

Even without KP hints at inference, KnowRL-Nemotron-1.5B's policy is meaningfully stronger (70.08, +9.63 over the backbone). The selected-KP setting (74.16) adds an inference-time boost on top. The largest gains appear on the hardest competition-style benchmarks — exactly where reward sparsity used to kill the learning signal — confirming that KnowRL improves policy quality itself, not just prompt-time scaffolding.

Section 07 · Selection Strategy

CSS vs CBRS — Which Selection Strategy Wins?

CSS and CBRS select comparable numbers of KPs per problem, so comparing them isolates selection quality from guidance volume. Section 5 of the paper studies their training dynamics and downstream accuracy under matched training budgets.

Training accuracy and clip ratio curves comparing CSS and CBRS strategies — **Figure 5.** (Left) Training accuracy — CSS stays above CBRS throughout the run and converges slightly higher. (Right) Clip ratio — CBRS shows a sharp late-training spike while CSS stays smoother, indicating CBRS triggers more aggressive policy updates while CSS enables more stable refinement.

CSS

CSS — stable, global search

Prunes low-value candidates first, then enumerates over a small constrained set — a broader search for high-quality global configurations. Consistently leads under matched budgets (65.00 vs 64.68 at step 400; 66.46 vs 65.72 at step 900) and wins the hardest benchmarks (HMMT25, CMIMC25).

CBRS

CBRS — noise-robust consensus

Aggregates votes across independent rollouts, so it is robust to sampling noise when the candidate pool is small. Induces slightly more aggressive policy updates (higher clip ratio late in training) which can help in some regimes but tends to be less stable than CSS on the hardest reasoning splits.

Entropy annealing ablation on AIME24, AIME25, BrumoS25 — **Appendix B.** Entropy-annealing ablation. Entropy peaks around step ~1,500, then decays when annealing kicks in at step 2,590. AIME24, AIME25, and BrumoS25 accuracy all converge to a higher plateau with entropy annealing enabled.

Section 08 · Conclusion

What KnowRL Changes About RLVR

KnowRL reframes hint design for RLVR as a minimal-sufficient guidance problem rather than a quantity-expansion one. By decomposing prefixes and abstractions into atomic Knowledge Points and selecting robust subsets via Constrained Subset Search, it sidesteps the critical-segment effect, cross-hint inconsistency, and guidance-efficiency trade-off all at once.

The paper also isolates a pruning interaction paradox — removing one KP can help, while removing several can hurt more than expected — and shows that CSS handles it explicitly. KnowRL-Nemotron-1.5B sets a new state of the art at the 1.5B scale (70.08 / 74.16 average) and ships with publicly available model, training data, and code, inviting reproducible follow-up work on minimal-sufficient guidance for reasoning RL.

References

References (20+ entries — click to expand)

M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, M. Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions, 2025.
Y. Chen, J. Sheng, W. Zhang, T. Liu. Improving reasoning capabilities in small models through mixture-of-layers distillation. EMNLP 2025.
D. Guo et al. DeepSeek-R1 technical report, 2025a.
K. Li et al. QuestA: Curriculum-based solution-prefix hint RL, 2025.
Y. Liu et al. UFT: Unified fine-tuning with adaptive hint ratios, 2025a.
Y. Liu et al. Compass Verifier-3B, 2025b.
Z. Moshkov et al. OpenMath-Nemotron-1.5B technical report, 2025.
S. Nath et al. Guide: Template-based reasoning guidance for RL, 2025.
N. Nie et al. Scaling RL with verifiable rewards, 2026.
Y. Qu et al. POPE: Prefix-optimized policy exploration, 2026.
Z. Shao et al. GRPO: Group Relative Policy Optimization in DeepSeekMath, 2024. arXiv:2402.03300.
B. Team et al. Open-source reasoning model release, 2025.
R. Team et al. RLVR foundations, 2026.
Y. Wang et al. RL for reasoning: a survey, 2026.
Z. Wu et al. TAPO: Template-augmented policy optimization, 2025.
Q. Yu et al. Dynamic sampling for RL training, 2025.
H. Zhang, Math-AI. AIME24 benchmark, 2024.
H. Zhang, Math-AI. AIME25 benchmark, 2025.
K. Zhang et al. StepHint: Adaptive step-level hints for RL, 2025b.
L. Zhang et al. Scaf-GRPO: Scaffolded GRPO with abstract hints, 2025c.
HMMT25, CMIMC25, BrumoS25, MATH-500, Olympiad-Bench: standard reasoning evaluation suites used in this paper.
Full reference list available in the arXiv PDF.