RAGEN-2: Reasoning Collapse in Agentic RL

Overview

Abstract & Overview

Reinforcement learning (RL) training of multi-turn large language model (LLM) agents has shown great promise, yet it faces a critical challenge: reasoning collapse, where a model generates responses that are superficially diverse but semantically repetitive. Previous work used entropy to track training stability, but entropy can be misleadingly high even when the model has stopped attending to different inputs.

This paper introduces the concept of template collapse — a failure mode where agents produce input-agnostic responses despite high entropy — and proposes SNR-Aware Filtering based on reward-variance signal-to-noise ratio to restore diverse, task-grounded reasoning across planning, math, web navigation, and code execution environments.

Key Contributions

Template Collapse identification: Formally defines a new failure mode where entropy appears high but reasoning is input-agnostic — indistinguishable by entropy alone.
MI-based diagnostics: Shows mutual information (MI) outperforms entropy as a collapse diagnostic, with Spearman correlation +0.39 (MI-ZScore) vs −0.14 (entropy) with task success rate.
SNR-Aware Filtering: A gradient-level remedy that filters low-reward-variance prompts (low SNR) before gradient updates, preventing regularization from dominating training.
Cross-domain validation: Experiments across Sokoban (planning), SearchQA (math reasoning), WebShop (web navigation), and DeepCoder (code execution) all show consistent improvement.

Section 1

Introduction

Multi-turn reinforcement learning for LLM agents promises to unlock complex, sequential decision-making capabilities. However, training instability is a persistent problem. Practitioners have historically tracked entropy of the agent's output distribution as a proxy for training health — high entropy was assumed to indicate diverse, healthy reasoning.

RAGEN-2 challenges this assumption. Through systematic analysis of multiple environments and RL algorithms (PPO and GRPO), the authors show that high entropy is neither sufficient nor necessary for healthy training. An agent can exhibit high entropy while producing semantically identical, input-agnostic responses — the definition of template collapse.

Research Questions

Q1 — Diagnosis

How can we diagnose reasoning collapse beyond entropy? What metric reliably distinguishes genuine diverse reasoning from template collapse?

Q2 — Remedy

What is the underlying mechanism causing template collapse, and how can it be mitigated without adding complex regularization?

The paper's answer to Q1 is mutual information (MI) between inputs and outputs — a metric that directly measures whether the agent's responses are actually conditioned on the input. For Q2, they identify a signal-to-noise ratio (SNR) mechanism: when a batch contains many low-reward-variance prompts, the regularization gradient dominates the task gradient, pushing the model toward input-agnostic templates.

Section 2

Template Collapse — The Failure Mode

Template Collapse vs Diverse Reasoning comparison diagram — **Figure 1:** Template collapse (left) vs. diverse reasoning (right). Despite high entropy in both cases, template collapse produces fixed-pattern responses regardless of input (e.g., "That's a good question…", "I need to solve the task…"), while diverse reasoning generates input-specific thoughts (e.g., "Move up agent twice…", "I see two boxes…").

2.1 What is Template Collapse?

Template collapse occurs when an RL-trained LLM agent converges to a fixed set of response templates that are applied regardless of the actual input. The model learns to output phrases like "I need to solve the task step by step" or "Let me think about this carefully" as boilerplate preambles that precede any reasoning, effectively ignoring input-specific information.

What makes this insidious is that standard entropy metrics fail to detect it. Because the templates themselves may vary superficially (the model selects from a pool of template phrases), the overall token distribution can appear diverse. Only when you ask whether the output depends on the input does the collapse become visible.

Why is template collapse so hard to catch? Imagine a chatbot that learned to always start its answers with "That's a great question! Let me think step by step..." regardless of what it was asked. If you sample its outputs multiple times for the same prompt, you'll see variety (different step-by-step continuations) — high entropy. But across different questions, it always opens the same way — low MI. Traditional monitoring only checks entropy, so the collapse goes completely unnoticed. This is exactly the trap that many RL-trained LLM systems have fallen into.

2.2 Mutual Information vs Entropy

The key insight is the distinction between two related but different quantities: entropy H(Y) measures the diversity of outputs within a single input (how variable are the N samples for prompt x?), while mutual information MI(X;Y) measures whether outputs change across inputs (do different prompts produce meaningfully different responses?).

Definition: Mutual Information (MI)

Mutual information between input X and output Y quantifies how much knowing the input reduces uncertainty about the output:

\text{MI}(X;Y) = H(Y) - H(Y|X)

Equivalently, MI = H(Y) − H(Y|X). When MI is high, different inputs lead to genuinely different outputs. When MI is low despite high H(Y), the model has collapsed into input-agnostic behavior — template collapse.

This distinction is crucial: entropy measures within-input diversity (variance across rollouts of the same prompt), while MI measures cross-input discriminability (whether the agent responds differently to different situations). Template collapse exhibits high H(Y) but low MI — the agent is "creative" within a fixed template space but not actually responsive to the task.

Concrete example: When entropy lies. Suppose a model is trained on Sokoban (box-pushing puzzles) and collapses to always generating: "I need to analyze the board carefully. Let me move systematically..." as a preamble regardless of the actual board layout. Sampling 8 rollouts per prompt, the outputs are all different in their continuations — entropy is high. But comparing outputs across different board states, the opening is identical — MI is near zero. High entropy told you "the model is exploring." MI tells you the truth: "the model isn't even looking at the board."

2.3 Online MI Proxy Metrics

Since exact MI computation is expensive, RAGEN-2 proposes a family of online MI proxies that can be computed efficiently during training:

MI-ZScore

Standardizes the mean reward across prompts using a Z-score. High cross-prompt variance in reward signal indicates input-dependent behavior (high MI). The simplest and most effective proxy.

MI Seq Estimate

Estimates MI directly from the sequential structure of outputs, using the frequency of distinct response prefixes across different prompts as an information-theoretic estimate.

MI-ZScore (Seq)

Combines the Z-score approach with sequence-level estimation for a more robust proxy that handles variable-length responses and noisy reward signals.

Section 3

SNR Mechanism — Why Collapse Happens

Understanding why template collapse occurs is essential for designing effective remedies. RAGEN-2 proposes a gradient-level explanation: the signal-to-noise ratio (SNR) of the gradient updates determines whether the model learns task-specific behavior or converges to templates.

SNR mechanism: high vs low task-related gradient — **Figure 3:** High vs. low task-related gradient. With high-reward-variance prompts (top), the task gradient dominates and the model learns task-appropriate responses. With low-reward-variance prompts (bottom), the regularization gradient dominates, pushing the model toward input-agnostic templates.

3.1 Empirical Observation

The authors group prompts by their in-batch reward variance (RV) into quantile buckets Q1–Q6. For each bucket, they measure three quantities: reward variance, task gradient norm, and regularization gradient norm.

Reward variance quantile analysis showing task gradient vs regularization gradient — **Figure 4:** Reward variance quantile analysis (Q1=low RV, Q6=high RV) for PPO (top) and GRPO (bottom). As RV increases: (a) reward variance grows rapidly, (b) task gradient norm grows proportionally, but (c) regularization gradient norm remains nearly constant. This confirms that low-RV prompts have a dominated SNR.

3.2 Gradient Decomposition

The RL training objective can be decomposed into a task-specific component and a regularization component (e.g., KL divergence from a reference model). The total gradient is the sum of these two components:

Gradient Decomposition

\nabla \mathcal{L} = \underbrace{\nabla \mathcal{L}_\text{task}}_{\text{task gradient}} + \underbrace{\nabla \mathcal{L}_\text{reg}}_{\text{regularization gradient}}

The task gradient is proportional to the reward variance within a prompt's rollouts. When reward variance is near zero (all rollouts get the same reward), the task gradient vanishes, and only the regularization gradient drives weight updates.

\text{SNR} = \frac{\|\nabla \mathcal{L}_\text{task}\|}{\|\nabla \mathcal{L}_\text{reg}\|}

The SNR is defined as the ratio of task gradient magnitude to regularization gradient magnitude. Low-RV prompts have SNR ≪ 1, meaning regularization dominates. The regularization term (KL from reference) pushes the model toward the average behavior over all inputs — i.e., input-agnostic templates.

Step-by-step: Why low-RV prompts break training

Think of RL training as a tug-of-war between two forces pulling on the model's weights:

Task gradient: "Update weights to get higher reward on this specific task." — Strength is proportional to how much rewards vary across rollouts (reward variance).
Regularization gradient: "Stay close to the reference model (KL divergence)." — Strength is roughly constant, independent of reward variance.

For a prompt where all rollouts receive reward ≈ 0 (the model can't solve it at all) or reward ≈ 1 (the model already solves it perfectly), reward variance is near zero. The task gradient vanishes. Only regularization remains, and it pulls the model toward the average behavior across all inputs — input-agnostic templates. Over thousands of gradient steps, these low-RV prompts accumulate template behavior until it spreads through the whole model.

3.3 SNR-Aware Filtering

The remedy is elegantly simple: filter out low-RV (low-SNR) prompts before computing gradients. By keeping only prompts with sufficiently high reward variance, every gradient update is dominated by meaningful task signal. The Top-P variant ranks prompts by RV and cumulatively selects the top fraction:

SNR-Aware Filtering (Top-P)

1. Sample N rollouts per prompt and evaluate rewards.

2. Compute per-prompt reward variance (RV). Sort prompts by RV descending.

3. Accumulate prompts from highest RV until the cumulative sample count reaches Top-P fraction of the batch. Use only these prompts for the gradient update.

Top-P vs Top-K: why cumulative selection beats fixed count. Top-K filtering keeps the K prompts with highest reward variance. But if K=10 and there are 100 prompts, you always use exactly 10 — even if prompt #11 has very high RV and prompt #10 has barely higher RV than average. Top-P instead ranks prompts by RV descending and cumulates until the total number of training samples (not prompts) reaches P×batch_size. This adapts to the distribution: if a few prompts have very high RV, Top-P takes fewer prompts but more representative ones. If RV is spread evenly, Top-P behaves similarly to Top-K.

SNR-Aware Filtering algorithm 3-step visualization — **Figure 5:** SNR-Aware Filtering algorithm visualization. Prompt A has RV=9.0 (high), prompt B has RV=1.0 (low), prompt C has RV=5.0 (medium). After Top-P filtering, low-RV prompt B is excluded, and only A and C contribute to the gradient update.

Section 4

Experiments

4.1 Evaluation Testbed

To validate both the universality of template collapse and the effectiveness of SNR-Aware Filtering, the authors construct a diverse testbed spanning four distinct task types:

🧩 Sokoban Puzzle Planning

🔍 SearchQA Math Reasoning

🛒 WebShop Web Navigation

💻 DeepCoder Code Execution

4.2 Template Collapse as a Consistent Failure Mode

Across all four environments, training without SNR-Aware Filtering consistently produces collapse signatures: the MI-ZScore metric drops, success rate stagnates, and output length sharply decreases — a behavioral indicator that the model is generating short, template-like responses.

SearchQA training curves showing template collapse — **Figure 6 (SearchQA):** Training curves comparing Top-P filtering, entropy regularization, KL regularization, and no filtering. (a) Success rate, (b) retrieval accuracy, (c) reasoning entropy. Top-P filtering achieves the highest performance while maintaining healthy entropy. Entropy and KL regularization alone cannot prevent collapse.

The output length collapse is particularly informative: when a model falls into template collapse, it learns to front-load a fixed preamble and skip problem-specific reasoning, resulting in shorter total responses. This behavioral signal can serve as an early warning indicator of collapse onset.

Output length collapse across multiple environments — **Figure 8:** Output length evolution across multiple environments during training. Template collapse conditions (no filtering) show characteristic sharp drops in output length, as the model converges to short, fixed-template responses. SNR-Aware Filtering (Top-P) maintains consistent output length.

4.3 SNR-Aware Filtering Consistently Improves Performance

Across all four environments, SNR-Aware Filtering (Top-P) outperforms or matches the best baseline. The results show a clear ordering: Top-P ≥ Top-K ≥ No Filtering in terms of final success rate.

SNR-Aware Filtering results across 4 environments — **Figure 7:** Success rate across all 4 environments (Sokoban, SearchQA, WebShop, DeepCoder) comparing Top-P filtering (blue), Top-K filtering (green), and no filtering (gray dashed). Top-P filtering achieves the highest or competitive performance in all environments. Gaps are especially pronounced in Sokoban and WebShop.

The consistency across PPO and GRPO algorithms, and across planning, reasoning, navigation, and coding tasks, suggests that template collapse is a fundamental challenge in agentic RL training — not an artifact of any particular environment or algorithm.

Section 5

Analysis

5.1 MI Diagnoses Collapse Better Than Entropy

To quantify how well different metrics track task success, the authors compute Spearman rank correlation between each metric and final task success rate across multiple training runs. The results are striking:

Spearman correlation chart: MI vs Entropy with task success rate — **Figure 9:** Spearman correlation between diagnostic metrics and task success rate. MI-based metrics (blue) show strong positive correlation — MI-ZScore reaches +0.39. Entropy-based metrics (orange) show negative correlation (−0.11 to −0.14), confirming they are *inversely* related to task performance during collapse.

Metric	Spearman ρ	Type
MI-ZScore	+0.39	MI-based
MI Seq Estimate	+0.22	MI-based
MI-ZScore (Seq)	+0.09	MI-based
Cond. Entropy	−0.14	Entropy-based
Reasoning Entropy	−0.11	Entropy-based

What does Spearman correlation mean here? The paper measures Spearman rank correlation (ρ) between each metric and task success rate across many training runs. ρ = +1 means the metric perfectly predicts success rank; ρ = −1 means it perfectly predicts failure. ρ = 0 means no relationship. The key finding: MI-ZScore ρ = +0.39 means runs with higher MI tend to be more successful. Entropy ρ = −0.14 means runs with higher entropy actually tend to do worse — not just uninformative but actively misleading as a health metric.

The negative correlation of entropy metrics with task success is the central empirical result of the paper: entropy is not just uninformative — it actively misleads. High entropy can co-occur with collapsed training, making it a dangerous monitoring metric. Practitioners relying solely on entropy to detect training problems may actually miss the most severe form of collapse.

5.2 When Does Filtering Help?

SNR-Aware Filtering is most effective when training has a non-trivial degree of stochasticity — i.e., when reward signals are uncertain and variable across rollouts. The authors test performance at different stochasticity levels (0–100%):

Stochasticity analysis: Top-p vs No Filtering — **Figure 10:** Success rate vs. stochasticity fraction (0%=deterministic, 100%=fully random rewards). Top-P filtering shows the largest advantage at 5–50% stochasticity. At 0% (deterministic), both methods perform similarly; at 80–100% (near-random), the gap narrows again.

This result is theoretically grounded: when rewards are fully deterministic, all prompts have either RV=0 (model already converged) or high RV (still learning), and filtering becomes trivial. The sweet spot is partially stochastic environments — exactly the condition of realistic multi-step agent training.

Why does stochasticity matter? Reward variance (RV) only exists when the model's outputs sometimes succeed and sometimes fail on the same prompt — i.e., when the environment has some randomness or the model hasn't fully converged. In real agent training (web navigation, planning), this partial stochasticity is the norm. The SNR-Aware Filtering is designed exactly for this regime. At 0% stochasticity, all prompts are either always-pass or always-fail, so every prompt has RV=0 and filtering can't help. At 80–100%, rewards are nearly random regardless of action, so task gradients are noise anyway.

5.3 Training Dynamics

Deeper analysis reveals how filtering shapes training dynamics. Without filtering, the number of zero-variance (ZV) prompts grows over training as the model collapses to deterministic templates. With Top-P filtering, ZV prompts are actively excluded, maintaining high-quality gradient signal throughout training.

Training dynamics: kept ratio, zero-var count, reward variance, success rate — **Figure 11 (Sokoban):** Training dynamics for Top-P vs Top-K vs No Filtering. (a) Kept ratio, (b) Zero-variance count, (c) Reward variance, (d) Success rate. Top-P filtering maintains a healthy kept ratio while suppressing ZV growth — directly correlating with higher final success.

Heatmaps: prompt-level reward and RV evolution over training — **Figure 12:** Heatmaps of prompt-level reward and RV evolution during training (early/mid/late). As training progresses, prompts polarize into high-RV (healthy learning) and low-RV (collapsed) clusters. Template collapse preferentially affects prompts where the model has reached near-deterministic policy.

5.4 MI vs Task-Solving Scatter

Individual-run scatter plots confirm the correlation: training runs with high MI (measured by MI-ZScore) cluster at high task-solving rates, while low-MI runs cluster at the bottom. Conditional entropy shows no such pattern — confirming MI's diagnostic superiority.

MI vs task-solving scatter plot — **Figure 13:** Scatter plots of MI (left) and conditional entropy (right) vs. task-solving ratio. Each point represents one training run at one checkpoint. MI shows clear positive correlation; conditional entropy is uncorrelated — demonstrating MI's advantage as a monitoring metric.

A key practical implication: MI-ZScore can serve as a real-time training monitor. If MI-ZScore begins to fall during training, it is a reliable early warning sign of imminent template collapse — more reliable than entropy declining or output length changes.

Hyperparameter sensitivity sweep ablation — **Figure 14:** Hyperparameter sensitivity sweep comparing entropy coefficient, KL coefficient, and Top-P ratio. RV filter sweep (SNR-Aware Filtering) consistently produces trajectories toward higher success rates regardless of parameter value, while entropy and KL sweeps show inconsistent directions — demonstrating the robustness of the filtering approach.

Section 7

Conclusions and Limitations

RAGEN-2 makes three interconnected contributions to the practice of agentic RL training for LLMs:

Template Collapse is Real

A formally defined failure mode in agentic RL where agents produce input-agnostic responses that entropy metrics cannot detect. Consistent across PPO and GRPO, across four environments spanning planning, math, web, and code.

MI Outperforms Entropy

Mutual information (MI-ZScore) achieves +0.39 Spearman correlation with task success while entropy achieves −0.14 — making entropy not just ineffective but actively misleading as a training health monitor.

SNR-Aware Filtering Works

A simple, gradient-level intervention — filtering low-reward-variance prompts before gradient updates — consistently improves performance across all tested environments with minimal computational overhead.

Limitations

SNR-Aware Filtering requires a reward signal to compute reward variance. It cannot be applied in reward-free or purely imitation-learning settings.
The filtering approach introduces computational overhead (requiring N rollouts per prompt to estimate RV) and may reduce effective batch size, which could slow convergence in some settings.
Experiments focus on text-based environments. Extension to multimodal agents (vision, audio) and continuous action spaces remains future work.