arXiv:2604.11297 · April 2026
Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes.
With the advancement of the fundamental capabilities of Large Language Models (LLMs), reinforcement learning has achieved significant success across various domains. By incorporating reward signals—whether derived from rule-based evaluation or proxy models—LLMs iteratively alternate between a sampling phase and a gradient-based optimization phase. As the model's performance is optimized toward maximizing expected return, designing the reward scoring structure becomes the primary way to guide the model's behavior.
The Core Problem: As RL training progresses, the policy often collapses into a narrow, stereotyped set of behaviors. This degeneration produces highly repetitive responses that waste on-policy samples and entrench the model in self-reinforcing erroneous reasoning trajectories. Classical entropy regularization promotes randomness at the distribution level, but fails to address the fundamental issue of recurring behavioral patterns across rollouts.
In reinforcement learning, entropy regularization is a technique that adds a bonus reward for being "random." Think of it like a teacher giving extra credit to a student who tries different approaches to solving a math problem, instead of always using the same method. The problem? This randomness is at the word level—the model might vary which words it picks, but still follow the same flawed reasoning strategy underneath. It's like a student who rearranges the words in their wrong answer instead of actually trying a different solving method.
The challenge is that distribution-level stochastic exploration often cannot disambiguate between the randomness that discovers genuinely novel strategies and the randomness that merely shuffles between the same set of failed approaches. A model may sample diverse tokens yet still follow identical flawed reasoning paths—for instance, repeatedly misreading the problem setup or applying an incorrect formula, even as the surface-level text varies.
MEDS addresses this gap by operating at the behavioral pattern level rather than the token distribution level. Instead of encouraging generic randomness, it identifies and penalizes the specific error patterns that recur across rollouts, directly incentivizing the model to explore genuinely different reasoning strategies.
The MEDS framework augments standard reinforcement learning with a memory-based penalty that targets frequently recurring error patterns. Given an input \(x \sim \mathcal{D}\), the LLM policy \(\pi_\theta\) generates a response \(y \sim \pi_\theta(y|x)\). The standard RL objective maximizes the expected reward \(\mathbb{E}[r(x,y)]\). MEDS modifies this by introducing a shaped reward: \(r_s(x,y) = r(x,y) - \text{penalty}(c_i)\), where \(c_i\) is the cluster assignment based on the response's behavioral pattern.
When an LLM generates text, it doesn't just produce words—internally, it computes probability distributions over all possible next tokens at every step. These internal signals (logits) are like the model's "thought process." MEDS uses these signals as a fingerprint for each response's reasoning strategy. Imagine two students solving a math problem: even if their written answers look different on the surface, if they're both making the same conceptual mistake, their internal reasoning patterns would be similar. MEDS captures exactly this—grouping responses by how the model thinks, not just what it writes.
With the standard reward \(r(x,y)\), the updated policy \(q_1\) converges toward patterns that maximize return. By introducing an error-cluster penalty \(r(x,y) - \lambda c(y)\), the modified policy \(q_2\) is provably encouraged to distribute probability mass away from large error clusters.
The key theoretical result (Theorem 2) shows that under the shaped reward, the updated policy \(q_2\) achieves higher entropy \(H(q_2) \geq H(q_1)\) while maintaining expected performance. This means MEDS provably increases exploration diversity without sacrificing quality.
Theorem 2 (Informal): Let \(q_1\) and \(q_2\) be the updated policies under the original reward \(r(x,y)\) and the shaped reward \(r(x,y) - \lambda c(y)\) respectively. Then \(H(q_2) \geq H(q_1)\), meaning the shaped reward provably increases output diversity.
Theorem 2 provides a mathematical guarantee that MEDS works as intended. In plain terms: if you penalize the most common error patterns, the model is provably forced to explore more diverse strategies. The key insight is that this happens without sacrificing quality—the model doesn't become more random in a harmful way. Instead, it specifically avoids repeating the same mistakes, which naturally leads it to discover new and potentially correct approaches. Think of it as: rather than randomly trying doors in a maze, the model remembers which doors led to dead ends and avoids them, guaranteeing it explores more of the maze.
To implement the indicator function \(c(y)\), MEDS directly leverages the model's own intermediate representations. For each response \(y\) generated by the policy, the method collects logit vectors from a specific intermediate layer. These vectors are pooled (e.g., via mean pooling over sequence positions) to form a fixed-dimensional feature vector that captures the response's reasoning logic. This approach is computationally efficient since the representations are already computed during the standard forward pass—no additional inference is needed.
Logits are the raw scores that a language model computes before converting them into probabilities for the next token. For example, if the model is deciding what word comes next, it might assign a score of 5.2 to "the," 3.1 to "and," and -1.7 to "banana." These scores (logits) reveal the model's internal preferences. MEDS collects these from an intermediate layer—not the final output layer—to capture the underlying reasoning logic rather than just the surface-level word choices.
Based on the constructed response representations, MEDS computes cluster assignments using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). A memory buffer stores feature vectors from past rollouts. For each new batch, the method:
The final shaped reward is: \(r_s(x,y) = r(x,y) - \text{penalty}(c_i)\), where the penalty function increases with the size of the assigned cluster, creating a direct pressure against the most common failure modes.
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups data points by how densely they're packed together. Unlike K-means, which requires you to specify the number of clusters in advance, HDBSCAN discovers clusters automatically and can handle "noise" (responses that don't fit any pattern). This is ideal for MEDS because: (1) We don't know how many error patterns exist in advance, (2) Some responses might be genuinely unique and shouldn't be forced into a cluster, and (3) Error patterns can have irregular shapes in the feature space.
Table 1 summarizes the performance across three base models and five mathematical reasoning benchmarks. MEDS consistently achieves the best overall average performance, demonstrating strong generalization across models with different levels of prior mathematical training.
Across all configurations, MEDS delivers the highest average pass@1 and pass@128 scores. The improvements are most pronounced on challenging benchmarks like AIME24 and OlympiadBench, where the diversity of explored reasoning strategies matters most. Notably, the gains hold for both pass@1 (best single attempt) and pass@128 (best of 128 attempts), indicating that MEDS improves both the quality and breadth of generated solutions.
pass@k measures the probability that at least one of k generated solutions is correct. pass@1 is the accuracy of a single attempt—like getting one shot at an exam question. pass@128 gives the model 128 attempts and checks if any one is correct. Improving pass@1 means the model's best guess is better; improving pass@128 means it explores a wider range of strategies. MEDS improves both, showing it makes the model both smarter and more creative in its problem-solving.
To understand how MEDS influences the model's exploration during reasoning, we conducted a detailed analysis from both behavioral and representational perspectives. Using Claude-Haiku-4.5 as a proxy annotator, we evaluated the semantic diversity of sampled responses. MEDS achieves a diversity score of 61.2, substantially higher than DAPO (45.16) and GRPO w/ Entropy Adj. (52.52).
| Method | Diversity Score |
|---|---|
| DAPO | 45.16 |
| GRPO w/ Entropy Adj. | 52.52 |
| MEDS-v1 | 54.71 |
| MEDS-v2 | 53.87 |
| MEDS (Full) | 61.2 |
From a representational perspective, we analyze the Top-1 Eigen Ratio—a measure of representation collapse in the output space. A higher ratio indicates that the model's outputs are concentrating in fewer dimensions, signaling reduced diversity. MEDS maintains a consistently lower eigen ratio throughout training, confirming that it preserves representational diversity at a fundamental level.
The Top-1 Eigen Ratio measures how much the model's outputs concentrate in a single direction in the representation space. Imagine the model's outputs as arrows in a high-dimensional space. If all arrows point roughly the same way, the eigen ratio is high (close to 1.0)—this signals representation collapse, where the model has lost diversity. A lower ratio means the arrows spread out in many directions, indicating diverse reasoning strategies. MEDS keeps this ratio lower than DAPO throughout training, meaning it better preserves the model's ability to think in varied ways.
A fundamental premise of MEDS is that logit vectors from intermediate layers capture the underlying logical reasoning structure—not just surface-level token predictions. We validate this through both qualitative case studies and quantitative analysis at scale. The logit representations of different responses to the same problem form distinct clusters that correspond to semantically meaningful reasoning strategies (correct vs. incorrect approaches).
t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique for visualizing high-dimensional data in 2D. Think of it as taking a complex 3D sculpture and photographing it from the angle that best shows its structure. Each dot in the figure represents one model response, and responses with similar internal reasoning patterns appear close together. The fact that distinct clusters emerge shows that logit features genuinely capture different reasoning strategies—they're not just random noise.
To validate at scale, we used Claude-Haiku-4.5 as a proxy annotator to label the reasoning strategy of randomly selected responses. The annotation procedure confirmed that logit-based clusters correspond to semantically coherent reasoning patterns: responses in the same cluster tend to follow the same reasoning approach (e.g., attempting prime factorization vs. trial division), regardless of whether they reach the correct answer.
We investigate how different feature construction and clustering methods affect performance. The ablation compares: random cluster assignments (control), semantic features (from the model's text output), and logit features with various clustering algorithms. Results demonstrate that clustering quality matters significantly—logit-based features with HDBSCAN provide the best performance, while random clustering and semantic features are notably inferior.
If MEDS clusters responses randomly (assigning penalties without regard to actual reasoning patterns), the penalties become meaningless noise—hurting good responses as often as bad ones. If it uses only surface-level text features (semantic clustering), it might group responses that look similar but actually use different reasoning strategies. Only logit-based features capture the underlying reasoning logic, enabling HDBSCAN to form clusters that correspond to genuinely shared error patterns. This is why the full MEDS configuration significantly outperforms both random and semantic baselines.
MEDS demonstrates that incorporating historical behavioral signals into reward design can effectively combat the recurring error patterns that plague reinforcement learning for LLM reasoning. The key contributions of this work are:
The main limitation is that the methods explored for utilizing logits are relatively simple and do not incorporate more sophisticated aggregation techniques. Future work could explore more advanced feature extraction from intermediate representations, different clustering algorithms, and application to non-mathematical reasoning tasks such as code generation, multi-step planning, and open-ended creative writing.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.