The first comprehensive survey covering 200+ papers on how transformers concentrate attention on uninformative tokens, and what we can do about it
Tsinghua University · Meituan LongCat Team · The University of Hong Kong · Xiamen University · University of Michigan · The Ohio State University
As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on Attention Sink, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field.
This survey organizes the attention sink literature into a clear three-pillar framework. Fundamental Utilization covers how practitioners leverage attention sink patterns for efficient inference (KV cache compression, sparse attention). Mechanistic Interpretation explores why attention sink emerges through theories about softmax constraints, outlier circuits, and geometric properties. Strategic Mitigation presents architectural modifications to reduce or eliminate unwanted attention concentration.
The survey comprehensively classifies over 200 papers into a hierarchical taxonomy. Each branch connects specific research contributions to their respective categories across utilization strategies, interpretation theories, mitigation approaches, and practical applications.
Transformers, grounded in the multi-head self-attention mechanism, have emerged as a foundational architecture in machine learning with unparalleled ability to capture long-range dependencies. However, they exhibit a puzzling behavior: Attention Sink, where certain tokens (typically the first token or special tokens like [CLS]) receive disproportionately high attention regardless of their semantic content. This phenomenon affects model interpretability, inference efficiency, and can contribute to hallucinations.
First comprehensive survey systematically consolidating all AS-related research across Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation
Unified framework that clarifies key concepts, maps the evolution and trends of the field, and establishes connections between different research directions
Practical guidelines for researchers and practitioners covering applications in pre-training, tuning, inference, interpretability, hallucination reduction, safety, and more
Attention Sink refers to the phenomenon where a disproportionate amount of attention weight is concentrated on a small subset of specific yet semantically uninformative tokens. In autoregressive LLMs, this typically manifests as the first token (or BOS token) receiving overwhelmingly high attention scores across most attention heads and layers, regardless of the input content.
Think of attention sink like a meeting where everyone keeps looking at the same person sitting in the corner, even though that person hasn't said anything useful. In a transformer model, tokens (words or image patches) "attend" to each other to understand context. But for some reason, a huge chunk of that attention goes to the very first token, which is often just a formatting marker like <BOS> (Beginning of Sequence). It's as if the model is wasting its processing power staring at a blank spot instead of focusing on the actual content. This survey explores why this happens and what we can do about it.
The concept was first formally identified in autoregressive LLMs, where initial tokens were observed to dominate the attention distribution after Softmax normalization. Because the Softmax function requires attention weights to sum to one, when an attention head has no strong preference for any particular token, it "dumps" excess attention onto easily accessible tokens like the first position. This creates a persistent attention pattern visible as a bright vertical stripe in attention heatmaps.
Attention sink behavior varies across layers and heads. Early layers tend to show strong sink patterns, while deeper layers exhibit more diverse attention distributions. The phenomenon is not limited to the first token; special tokens like [CLS] in BERT and [SEP] can also act as attention sinks in bidirectional models.
Attention sink is not limited to standard autoregressive LLMs. The phenomenon manifests across virtually all transformer-based architectures, from classical masked language models to vision transformers and even video generation models. Each architecture exhibits unique attention sink characteristics.
In bidirectional models like BERT, [CLS] and [SEP] tokens act as attention sinks. [CLS] receives high attention in early layers, while [SEP] dominates in later layers. This pattern was one of the earliest observations of attention concentration on special tokens.
In MoE architectures like DeepSeek and Mixtral, attention sink interacts with expert routing. Sink tokens activate different distributions of experts compared to non-sink tokens, suggesting that the MoE routing mechanism is influenced by and potentially reinforces the attention sink phenomenon.
In vision-language models, visual tokens (<img>) interact with text tokens and the attention sink. The BOS token often absorbs attention that should go to visual content, potentially degrading visual understanding. This has led to attention redistribution techniques that redirect attention from sink tokens to image tokens.
Vision Transformers (ViTs) also exhibit attention sink, where certain patch tokens (often [CLS] or corner patches) receive disproportionate attention. This manifests as artifacts in attention maps and can degrade feature quality. Register tokens have been proposed to absorb excess attention and produce cleaner feature representations.
Attention sink has been observed in video generation transformers, diffusion models, speech models, and other specialized architectures. In video generation, removing attention sink handling leads to temporal inconsistency and visual quality degradation across generated frames.
Rather than treating attention sink as purely a problem, researchers have developed strategies to leverage the phenomenon for practical benefits. Four fundamental approaches have emerged: preserving sink tokens for stable inference, redistributing attention for better content focus, introducing learnable prefix tokens as explicit sinks, and repurposing sink tokens for new functionalities.
Key Takeaway: Keeping a few initial sink tokens in the KV cache is essential for stable long-context inference. StreamingLLM demonstrated that a sliding window plus preserved sink tokens dramatically reduces perplexity compared to naive window-based approaches.
Sink Token Preservation is a widely adopted strategy in LLM inference, particularly in token pruning, KV cache compression, and sparse attention mechanisms. The core insight is simple but powerful: because certain tokens reliably absorb attention across all heads and layers, removing them from the KV cache causes catastrophic performance degradation. By always retaining these critical sink tokens alongside a sliding window of recent tokens, models can process arbitrarily long sequences with bounded memory.
When you chat with an LLM like ChatGPT, the model needs to remember everything said so far. It does this with a KV (Key-Value) cache that stores processed representations of all previous tokens. As conversations get longer, this cache grows and eats up expensive GPU memory. StreamingLLM's insight is brilliant in its simplicity: instead of keeping everything (too expensive) or only keeping recent tokens (causes crashes), just keep the first few "sink" tokens plus a sliding window of recent ones. This small change lets LLMs handle infinitely long conversations with fixed memory, and it's already used in production systems.
Building on this insight, researchers have identified different attention head types that inform efficient sparse computation strategies. Lambda-shape heads show the classic attention sink pattern, vertical-slash heads exhibit columnar attention, and block-sparse heads show scattered attention blocks. Understanding these patterns enables targeted optimization of attention computation.
Key Takeaway: Instead of passively accepting attention sink, redistribution actively redirects attention mass from uninformative sink tokens to semantically relevant content tokens, improving model performance without retraining.
Attention Redistribution aims to mitigate the adverse effects of attention sink by reallocating their disproportionate attention mass to semantically relevant tokens. Unlike preservation which passively retains sink tokens as stable anchors, redistribution actively reshapes the attention distribution. This is particularly valuable in multi-modal LLMs, where attention absorbed by the BOS token can be redirected to visual content tokens, improving image understanding.
Key Takeaway: Introducing dedicated trainable tokens as explicit attention sinks during pre-training leads to cleaner attention distributions and better model performance than relying on emergent sink behavior.
Learnable Prefix Tokens introduce dedicated, trainable tokens that serve as explicit attention sinks. Unlike natural attention sinks that emerge from the first token or BOS, these tokens are model parameters optimized during training to absorb excess attention mass. Pre-training with explicit sink tokens produces cleaner attention distributions with well-defined sink behavior, reducing interference with content processing.
Key Takeaway: Register tokens in Vision Transformers absorb attention sink artifacts, producing cleaner feature maps. This repurposing converts a liability into a design tool for better representations.
Rather than simply preserving or redistributing sink behavior, some approaches repurpose the sink mechanism itself. In Vision Transformers, register tokens are added that serve as explicit attention sinks, absorbing artifacts that would otherwise corrupt feature maps. Models like DINOv2 with registers show dramatically cleaner attention maps and better downstream performance compared to models without registers.
Understanding why attention sink emerges is crucial for developing principled solutions. Five major theoretical frameworks have been proposed, each offering unique insights into the mechanisms driving this phenomenon. These theories are complementary rather than competing, illuminating different aspects of a complex, multi-faceted behavior.
Key Takeaway: The Softmax function's sum-to-one constraint forces attention heads to allocate weight somewhere, even when no token is truly relevant. Sink tokens serve as "attention dumps" for heads performing near-identity (no-op) operations.
Among the earliest explanations, this theory attributes attention sink to an inherent limitation of the Softmax function. In standard attention, the sum-to-one constraint requires that attention weights over all keys normalize to unity for every query. When an attention head has learned that no meaningful interaction exists for certain query positions, it cannot assign zero attention everywhere. Instead, it concentrates residual probability mass on a convenient dump target, typically the first token, creating the characteristic sink pattern.
Softmax is the function that converts raw attention scores into probabilities. Its key property is that all output values must sum to exactly 1.0 (100%). Here's the problem: imagine you have 100 tokens and an attention head that genuinely doesn't need to focus on any of them for a particular computation. With softmax, it must distribute 100% of its attention somewhere. It can't say "I don't care about any of these." So what does it do? It dumps most of that forced attention onto the first token, a convenient "trash bin" for unwanted attention weight. This is the no-op theory: some attention heads are essentially doing nothing (a "no operation"), but softmax forces them to pretend they're attending to something.
Key Takeaway: Massive activation outliers in specific hidden dimensions create the numerical conditions that sustain attention sink. These outliers form interconnected circuits across layers that amplify and maintain the sink pattern.
The Outlier Circuits perspective addresses a gap left by the Softmax theory: how are attention sinks numerically sustained? This framework identifies systematic outlier activations, specific hidden dimensions with extreme magnitudes, that form interconnected circuits across transformer layers. These outliers emerge in the FFN down-projection, propagate through residual connections, and influence the Q/K dot products that determine attention scores, creating a self-reinforcing loop that maintains the sink pattern.
Imagine a specific neuron in the model (say, channel #256 in a 4096-dimensional hidden state) that has learned to produce extremely large values, maybe 1000x larger than its neighbors. This "outlier" channel creates a domino effect:
The lifecycle in Figure 29 shows this process beautifully: the outlier emerges at Layer 1, stabilizes through most of the network, then dissipates near the final layer.
Key Takeaway: Attention to the sink token effectively acts as a learned bias term in the attention output. The value updates from the sink token are nearly constant across all positions, functioning as a global bias rather than content-dependent processing.
This interpretation views attention sink from a functional perspective: the attention weight allocated to the sink token produces a constant value update across all query positions. Since the value vector associated with the sink token is effectively the same regardless of what the rest of the sequence contains, the resulting contribution is a fixed bias added to every position's representation. This elegant theory explains why removing sink tokens is so disruptive, as it removes a learned bias that the model has come to depend on.
Key Takeaway: Initial tokens occupy a distinctive geometric position in the embedding space, forming clusters that act as stable "anchors" attracting attention from all other positions.
This theory examines attention sink through the lens of representation geometry. PCA analysis reveals that initial tokens form distinctive geometric clusters in the embedding space, separate from the manifold occupied by content tokens. With RoPE positional encoding, this separation is even more pronounced, as the encoding creates a natural ordering where initial positions become geometric anchors. The angular proximity of initial token representations to all query vectors explains why they consistently attract high attention scores.
Beyond the four major theories, emerging interpretations explore attention sink through information-theoretic perspectives, training dynamics analysis, and connections to loss landscape geometry. These complementary viewpoints continue to enrich our understanding of why transformers consistently develop this behavior pattern.
While utilization strategies work with attention sink, mitigation strategies aim to reduce or eliminate unwanted attention concentration through architectural modifications. Four main approaches have emerged, each targeting different aspects of the mechanism that produces attention sink.
Key Takeaway: Adding a learnable gate vector G alongside Q, K, V allows the model to explicitly suppress attention sink behavior. The gate controls how much attention information flows through, decoupling the no-op function from attention allocation.
Gated Attention Mechanisms directly respond to the Softmax/No-Op theory. Since attention sink emerges because heads learn to perform no-op operations through the attention mechanism, adding a gate allows the model to achieve the same no-op effect by simply closing the gate, freeing the attention weights to focus on semantically meaningful content. Variants include input-state gating, value-state gating, and attention output gating, each applying the gate at different points in the attention computation.
If you're building an LLM service and want to reduce attention sink, gated attention is one of the most practical options. The idea is adding a small learned vector G (same size as Q, K, V) that acts like a volume knob for each attention head. When the head wants to do a no-op (dump attention on the first token), it can now just "turn down the volume" via the gate instead. This means the attention weights are freed to focus on actual content. The overhead is minimal: one extra linear projection per layer, roughly a 3% parameter increase for substantial quality gains.
Key Takeaway: Replacing the standard Softmax with alternatives like Softpick or SigSoftmax breaks the sum-to-one constraint that forces attention sink, allowing heads to express "no strong preference" without dumping weight on a single token.
Modified Softmax Functions offer another direct approach to mitigating attention sink by intervening in the Softmax normalization itself. Unlike gated mechanisms that decouple no-op behavior via an additional pathway, these approaches directly address the root cause: the sum-to-one constraint. Alternatives like Softpick allow attention weights to be truly sparse, Softmax1 adds a bias unit that can absorb excess probability, and SigSoftmax combines sigmoid and softmax for more flexible distributions.
Learnable Attention Bias adds trainable bias terms directly to the attention scores before Softmax normalization. By providing an explicit learnable parameter to capture positional preferences, the model no longer needs to use the first token as an implicit bias mechanism. This approach is simple to implement, adds minimal parameters, and can be applied to existing architectures with fine-tuning.
Key Takeaway: The choice of optimizer during pre-training significantly impacts attention sink formation. Muon optimizer produces more uniform activation distributions, reducing the outlier spikes that drive attention sink, compared to Adam which creates extreme channel-specific activations.
Pre-training interventions address attention sink at its origin, during model training. The Muon optimizer, for example, produces dramatically more uniform activation distributions compared to Adam, which tends to create extreme outlier spikes in specific channels. By preventing the formation of outlier circuits during training, these interventions can reduce attention sink without any architectural modifications.
Attention sink knowledge has practical implications across nine key areas of transformer model development and deployment. Understanding and managing attention sink can improve model quality, efficiency, safety, and capability.
Design training procedures that account for attention sink emergence, including optimizer selection and explicit sink token strategies
Fine-tune attention patterns post-training through LoRA on attention weights, bias injection, or attention redistribution
Optimize KV cache management, sparse attention, and token pruning strategies that preserve sink tokens for stable inference
Use attention sink patterns as diagnostic tools for understanding model behavior and identifying attention head specialization
Redirect attention from sink tokens to factual content to reduce hallucinated outputs in both text and multi-modal generation
Detect backdoor attacks and adversarial inputs by analyzing attention sink disruption patterns
Improve overall model quality through better attention distribution across semantically relevant tokens
Enable efficient processing of long sequences through sink-aware KV cache compression and streaming attention
Improve cross-modal understanding by redistributing attention from text sink tokens to visual or audio content
Not every ML project needs to worry about attention sink. Here's a quick guide:
Excessive attention to sink tokens diverts the model's focus from actual content. In vision-language models, this means the model attends to the BOS token instead of the image, generating descriptions of things that aren't there. The attention map below shows how the sink token (bright column) correlates with hallucinated text output.
Attention sink analysis enables new approaches to AI safety. By examining how attention patterns shift around potential trigger tokens, researchers can identify and localize backdoor attacks. The attention sink helps identify where the backdoor is embedded, while value norm analysis reveals how it operates.
This survey presents the first comprehensive review of Attention Sink in Transformer architectures, systematically synthesizing over 200 studies across three dimensions: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Attention sink profoundly influences training dynamics, inference efficiency, and model behavior across LLMs, Vision Transformers, MoE models, and multi-modal architectures. By mapping the landscape of existing research and identifying open challenges, we aim to empower researchers and practitioners to effectively manage attention sink within the current transformer paradigm while inspiring the development of next-generation architectures.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.