Reinforcement learning (RL) training of multi-turn large language model (LLM) agents has shown great promise, yet it faces a critical challenge: reasoning collapse, where a model generates responses that are superficially diverse but semantically repetitive. Previous work used entropy to track training stability, but entropy can be misleadingly high even when the model has stopped attending to different inputs.
This paper introduces the concept of template collapse — a failure mode where agents produce input-agnostic responses despite high entropy — and proposes SNR-Aware Filtering based on reward-variance signal-to-noise ratio to restore diverse, task-grounded reasoning across planning, math, web navigation, and code execution environments.
Multi-turn reinforcement learning for LLM agents promises to unlock complex, sequential decision-making capabilities. However, training instability is a persistent problem. Practitioners have historically tracked entropy of the agent's output distribution as a proxy for training health — high entropy was assumed to indicate diverse, healthy reasoning.
RAGEN-2 challenges this assumption. Through systematic analysis of multiple environments and RL algorithms (PPO and GRPO), the authors show that high entropy is neither sufficient nor necessary for healthy training. An agent can exhibit high entropy while producing semantically identical, input-agnostic responses — the definition of template collapse.
How can we diagnose reasoning collapse beyond entropy? What metric reliably distinguishes genuine diverse reasoning from template collapse?
What is the underlying mechanism causing template collapse, and how can it be mitigated without adding complex regularization?
The paper's answer to Q1 is mutual information (MI) between inputs and outputs — a metric that directly measures whether the agent's responses are actually conditioned on the input. For Q2, they identify a signal-to-noise ratio (SNR) mechanism: when a batch contains many low-reward-variance prompts, the regularization gradient dominates the task gradient, pushing the model toward input-agnostic templates.
Template collapse occurs when an RL-trained LLM agent converges to a fixed set of response templates that are applied regardless of the actual input. The model learns to output phrases like "I need to solve the task step by step" or "Let me think about this carefully" as boilerplate preambles that precede any reasoning, effectively ignoring input-specific information.
What makes this insidious is that standard entropy metrics fail to detect it. Because the templates themselves may vary superficially (the model selects from a pool of template phrases), the overall token distribution can appear diverse. Only when you ask whether the output depends on the input does the collapse become visible.
The key insight is the distinction between two related but different quantities: entropy H(Y) measures the diversity of outputs within a single input (how variable are the N samples for prompt x?), while mutual information MI(X;Y) measures whether outputs change across inputs (do different prompts produce meaningfully different responses?).
Mutual information between input X and output Y quantifies how much knowing the input reduces uncertainty about the output:
Equivalently, MI = H(Y) − H(Y|X). When MI is high, different inputs lead to genuinely different outputs. When MI is low despite high H(Y), the model has collapsed into input-agnostic behavior — template collapse.
This distinction is crucial: entropy measures within-input diversity (variance across rollouts of the same prompt), while MI measures cross-input discriminability (whether the agent responds differently to different situations). Template collapse exhibits high H(Y) but low MI — the agent is "creative" within a fixed template space but not actually responsive to the task.
Since exact MI computation is expensive, RAGEN-2 proposes a family of online MI proxies that can be computed efficiently during training:
Standardizes the mean reward across prompts using a Z-score. High cross-prompt variance in reward signal indicates input-dependent behavior (high MI). The simplest and most effective proxy.
Estimates MI directly from the sequential structure of outputs, using the frequency of distinct response prefixes across different prompts as an information-theoretic estimate.
Combines the Z-score approach with sequence-level estimation for a more robust proxy that handles variable-length responses and noisy reward signals.
Understanding why template collapse occurs is essential for designing effective remedies. RAGEN-2 proposes a gradient-level explanation: the signal-to-noise ratio (SNR) of the gradient updates determines whether the model learns task-specific behavior or converges to templates.
The authors group prompts by their in-batch reward variance (RV) into quantile buckets Q1–Q6. For each bucket, they measure three quantities: reward variance, task gradient norm, and regularization gradient norm.
The RL training objective can be decomposed into a task-specific component and a regularization component (e.g., KL divergence from a reference model). The total gradient is the sum of these two components:
The task gradient is proportional to the reward variance within a prompt's rollouts. When reward variance is near zero (all rollouts get the same reward), the task gradient vanishes, and only the regularization gradient drives weight updates.
The SNR is defined as the ratio of task gradient magnitude to regularization gradient magnitude. Low-RV prompts have SNR ≪ 1, meaning regularization dominates. The regularization term (KL from reference) pushes the model toward the average behavior over all inputs — i.e., input-agnostic templates.
Think of RL training as a tug-of-war between two forces pulling on the model's weights:
For a prompt where all rollouts receive reward ≈ 0 (the model can't solve it at all) or reward ≈ 1 (the model already solves it perfectly), reward variance is near zero. The task gradient vanishes. Only regularization remains, and it pulls the model toward the average behavior across all inputs — input-agnostic templates. Over thousands of gradient steps, these low-RV prompts accumulate template behavior until it spreads through the whole model.
The remedy is elegantly simple: filter out low-RV (low-SNR) prompts before computing gradients. By keeping only prompts with sufficiently high reward variance, every gradient update is dominated by meaningful task signal. The Top-P variant ranks prompts by RV and cumulatively selects the top fraction:
To validate both the universality of template collapse and the effectiveness of SNR-Aware Filtering, the authors construct a diverse testbed spanning four distinct task types:
Across all four environments, training without SNR-Aware Filtering consistently produces collapse signatures: the MI-ZScore metric drops, success rate stagnates, and output length sharply decreases — a behavioral indicator that the model is generating short, template-like responses.
The output length collapse is particularly informative: when a model falls into template collapse, it learns to front-load a fixed preamble and skip problem-specific reasoning, resulting in shorter total responses. This behavioral signal can serve as an early warning indicator of collapse onset.
Across all four environments, SNR-Aware Filtering (Top-P) outperforms or matches the best baseline. The results show a clear ordering: Top-P ≥ Top-K ≥ No Filtering in terms of final success rate.
The consistency across PPO and GRPO algorithms, and across planning, reasoning, navigation, and coding tasks, suggests that template collapse is a fundamental challenge in agentic RL training — not an artifact of any particular environment or algorithm.
To quantify how well different metrics track task success, the authors compute Spearman rank correlation between each metric and final task success rate across multiple training runs. The results are striking:
| Metric | Spearman ρ | Type |
|---|---|---|
| MI-ZScore | +0.39 | MI-based |
| MI Seq Estimate | +0.22 | MI-based |
| MI-ZScore (Seq) | +0.09 | MI-based |
| Cond. Entropy | −0.14 | Entropy-based |
| Reasoning Entropy | −0.11 | Entropy-based |
The negative correlation of entropy metrics with task success is the central empirical result of the paper: entropy is not just uninformative — it actively misleads. High entropy can co-occur with collapsed training, making it a dangerous monitoring metric. Practitioners relying solely on entropy to detect training problems may actually miss the most severe form of collapse.
SNR-Aware Filtering is most effective when training has a non-trivial degree of stochasticity — i.e., when reward signals are uncertain and variable across rollouts. The authors test performance at different stochasticity levels (0–100%):
This result is theoretically grounded: when rewards are fully deterministic, all prompts have either RV=0 (model already converged) or high RV (still learning), and filtering becomes trivial. The sweet spot is partially stochastic environments — exactly the condition of realistic multi-step agent training.
Deeper analysis reveals how filtering shapes training dynamics. Without filtering, the number of zero-variance (ZV) prompts grows over training as the model collapses to deterministic templates. With Top-P filtering, ZV prompts are actively excluded, maintaining high-quality gradient signal throughout training.
Individual-run scatter plots confirm the correlation: training runs with high MI (measured by MI-ZScore) cluster at high task-solving rates, while low-MI runs cluster at the bottom. Conditional entropy shows no such pattern — confirming MI's diagnostic superiority.
A key practical implication: MI-ZScore can serve as a real-time training monitor. If MI-ZScore begins to fall during training, it is a reliable early warning sign of imminent template collapse — more reliable than entropy declining or output length changes.
RAGEN-2 makes three interconnected contributions to the practice of agentic RL training for LLMs:
A formally defined failure mode in agentic RL where agents produce input-agnostic responses that entropy metrics cannot detect. Consistent across PPO and GRPO, across four environments spanning planning, math, web, and code.
Mutual information (MI-ZScore) achieves +0.39 Spearman correlation with task success while entropy achieves −0.14 — making entropy not just ineffective but actively misleading as a training health monitor.
A simple, gradient-level intervention — filtering low-reward-variance prompts before gradient updates — consistently improves performance across all tested environments with minimal computational overhead.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.