Attention Sink in Transformers: A Survey

Abstract

As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on Attention Sink, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field.

Survey Overview

This survey organizes the attention sink literature into a clear three-pillar framework. Fundamental Utilization covers how practitioners leverage attention sink patterns for efficient inference (KV cache compression, sparse attention). Mechanistic Interpretation explores why attention sink emerges through theories about softmax constraints, outlier circuits, and geometric properties. Strategic Mitigation presents architectural modifications to reduce or eliminate unwanted attention concentration.

Survey structure overview — **Figure 1:** Overview of the survey structure, mapping three research pillars and their subcategories

Paper Taxonomy & Research Landscape

The survey comprehensively classifies over 200 papers into a hierarchical taxonomy. Each branch connects specific research contributions to their respective categories across utilization strategies, interpretation theories, mitigation approaches, and practical applications.

Taxonomy tree of surveyed papers — **Figure 2:** Complete taxonomy tree of all surveyed papers organized by research dimension

**Figure 3:** Cumulative publication count from 2023 to early 2026, showing explosive growth across all three research pillars

Introduction

Transformers, grounded in the multi-head self-attention mechanism, have emerged as a foundational architecture in machine learning with unparalleled ability to capture long-range dependencies. However, they exhibit a puzzling behavior: Attention Sink, where certain tokens (typically the first token or special tokens like [CLS]) receive disproportionately high attention regardless of their semantic content. This phenomenon affects model interpretability, inference efficiency, and can contribute to hallucinations.

1

First comprehensive survey systematically consolidating all AS-related research across Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation

2

Unified framework that clarifies key concepts, maps the evolution and trends of the field, and establishes connections between different research directions

3

Practical guidelines for researchers and practitioners covering applications in pre-training, tuning, inference, interpretability, hallucination reduction, safety, and more

What is Attention Sink?

Definition

Attention Sink refers to the phenomenon where a disproportionate amount of attention weight is concentrated on a small subset of specific yet semantically uninformative tokens. In autoregressive LLMs, this typically manifests as the first token (or BOS token) receiving overwhelmingly high attention scores across most attention heads and layers, regardless of the input content.

Why Does This Matter?

Think of attention sink like a meeting where everyone keeps looking at the same person sitting in the corner, even though that person hasn't said anything useful. In a transformer model, tokens (words or image patches) "attend" to each other to understand context. But for some reason, a huge chunk of that attention goes to the very first token, which is often just a formatting marker like <BOS> (Beginning of Sequence). It's as if the model is wasting its processing power staring at a blank spot instead of focusing on the actual content. This survey explores why this happens and what we can do about it.

The concept was first formally identified in autoregressive LLMs, where initial tokens were observed to dominate the attention distribution after Softmax normalization. Because the Softmax function requires attention weights to sum to one, when an attention head has no strong preference for any particular token, it "dumps" excess attention onto easily accessible tokens like the first position. This creates a persistent attention pattern visible as a bright vertical stripe in attention heatmaps.

Transformer architecture and attention sink — **Figure 4:** Standard Transformer architecture (left) and the attention sink phenomenon shown as concentrated attention on the first column of the attention matrix (right)

Attention sink behavior varies across layers and heads. Early layers tend to show strong sink patterns, while deeper layers exhibit more diverse attention distributions. The phenomenon is not limited to the first token; special tokens like [CLS] in BERT and [SEP] can also act as attention sinks in bidirectional models.

Attention heatmaps across layers — **Figure 6:** Attention heatmaps across different layers and heads of an LLM, showing the characteristic first-column concentration pattern (attention sink) that varies in intensity across the network

**Figure 7:** Modern LLM decoder block architecture (LLaMA-style) with LayerNorm, RoPE positional encoding, multi-head attention, and gated FFN

RoPE (Rotary Position Embedding) is a popular method for telling the transformer where each token is in the sequence. It works by rotating the query and key vectors by an angle proportional to their position, which naturally encodes distance between tokens. The fact that RoPE amplifies attention sink (by making initial positions geometrically distinct) is one of the key insights from this survey.

Attention Sink Across Model Types

Attention sink is not limited to standard autoregressive LLMs. The phenomenon manifests across virtually all transformer-based architectures, from classical masked language models to vision transformers and even video generation models. Each architecture exhibits unique attention sink characteristics.

Classical Language Models (BERT)

In bidirectional models like BERT, [CLS] and [SEP] tokens act as attention sinks. [CLS] receives high attention in early layers, while [SEP] dominates in later layers. This pattern was one of the earliest observations of attention concentration on special tokens.

BERT attention patterns — **Figure 5:** Attention patterns in BERT showing [CLS] and [SEP] tokens receiving disproportionate attention across layers

Mixture-of-Experts LLMs

In MoE architectures like DeepSeek and Mixtral, attention sink interacts with expert routing. Sink tokens activate different distributions of experts compared to non-sink tokens, suggesting that the MoE routing mechanism is influenced by and potentially reinforces the attention sink phenomenon.

**Figure 9:** Expert activation distribution comparing sink tokens vs non-sink tokens in Qwen3-30B and DeepSeek-V2-Lite

Multi-Modal LLMs

In vision-language models, visual tokens (<img>) interact with text tokens and the attention sink. The BOS token often absorbs attention that should go to visual content, potentially degrading visual understanding. This has led to attention redistribution techniques that redirect attention from sink tokens to image tokens.

Multi-modal attention — **Figure 10:** Multi-modal LLM processing a visual question with attention weights showing sink behavior on the BOS token

Vision Transformers

Vision Transformers (ViTs) also exhibit attention sink, where certain patch tokens (often [CLS] or corner patches) receive disproportionate attention. This manifests as artifacts in attention maps and can degrade feature quality. Register tokens have been proposed to absorb excess attention and produce cleaner feature representations.

**Figure 11:** ViT attention sink visualization showing disproportionate attention on specific patch tokens

Other Transformer Architectures

Attention sink has been observed in video generation transformers, diffusion models, speech models, and other specialized architectures. In video generation, removing attention sink handling leads to temporal inconsistency and visual quality degradation across generated frames.

Video generation attention — **Figure 12:** Attention sink effects in video generation, comparing quality with and without attention sink handling across time steps

Pillar 1: Utilization

Fundamental Utilization of Attention Sink

Rather than treating attention sink as purely a problem, researchers have developed strategies to leverage the phenomenon for practical benefits. Four fundamental approaches have emerged: preserving sink tokens for stable inference, redistributing attention for better content focus, introducing learnable prefix tokens as explicit sinks, and repurposing sink tokens for new functionalities.

Sink Token Preservation

Key Takeaway: Keeping a few initial sink tokens in the KV cache is essential for stable long-context inference. StreamingLLM demonstrated that a sliding window plus preserved sink tokens dramatically reduces perplexity compared to naive window-based approaches.

Sink Token Preservation is a widely adopted strategy in LLM inference, particularly in token pruning, KV cache compression, and sparse attention mechanisms. The core insight is simple but powerful: because certain tokens reliably absorb attention across all heads and layers, removing them from the KV cache causes catastrophic performance degradation. By always retaining these critical sink tokens alongside a sliding window of recent tokens, models can process arbitrarily long sequences with bounded memory.

KV Cache: Why This Matters for Real-World LLM Services

When you chat with an LLM like ChatGPT, the model needs to remember everything said so far. It does this with a KV (Key-Value) cache that stores processed representations of all previous tokens. As conversations get longer, this cache grows and eats up expensive GPU memory. StreamingLLM's insight is brilliant in its simplicity: instead of keeping everything (too expensive) or only keeping recent tokens (causes crashes), just keep the first few "sink" tokens plus a sliding window of recent ones. This small change lets LLMs handle infinitely long conversations with fixed memory, and it's already used in production systems.

StreamingLLM comparison — **Figure 13:** Comparison of four attention strategies: (a) Dense with full KV cache, (b) Window Attention, (c) Sliding Window with re-computation, (d) StreamingLLM preserving sink tokens. StreamingLLM achieves dramatically better perplexity by keeping attention sink tokens.

Building on this insight, researchers have identified different attention head types that inform efficient sparse computation strategies. Lambda-shape heads show the classic attention sink pattern, vertical-slash heads exhibit columnar attention, and block-sparse heads show scattered attention blocks. Understanding these patterns enables targeted optimization of attention computation.

Sparse attention head types — **Figure 14:** Three attention head types for sparse computation: Lambda-shape (attention sink), vertical-slash, and block-sparse patterns

Attention Redistribution

Key Takeaway: Instead of passively accepting attention sink, redistribution actively redirects attention mass from uninformative sink tokens to semantically relevant content tokens, improving model performance without retraining.

Attention Redistribution aims to mitigate the adverse effects of attention sink by reallocating their disproportionate attention mass to semantically relevant tokens. Unlike preservation which passively retains sink tokens as stable anchors, redistribution actively reshapes the attention distribution. This is particularly valuable in multi-modal LLMs, where attention absorbed by the BOS token can be redirected to visual content tokens, improving image understanding.

Learnable Prefix Tokens

Key Takeaway: Introducing dedicated trainable tokens as explicit attention sinks during pre-training leads to cleaner attention distributions and better model performance than relying on emergent sink behavior.

Learnable Prefix Tokens introduce dedicated, trainable tokens that serve as explicit attention sinks. Unlike natural attention sinks that emerge from the first token or BOS, these tokens are model parameters optimized during training to absorb excess attention mass. Pre-training with explicit sink tokens produces cleaner attention distributions with well-defined sink behavior, reducing interference with content processing.

Pre-training with sink token — **Figure 17:** Comparison of attention patterns when pre-trained without (left) vs with (right) an explicit sink token. The model with dedicated sink tokens shows cleaner, more organized attention distributions.

Sink Token Repurposing

Key Takeaway: Register tokens in Vision Transformers absorb attention sink artifacts, producing cleaner feature maps. This repurposing converts a liability into a design tool for better representations.

Rather than simply preserving or redistributing sink behavior, some approaches repurpose the sink mechanism itself. In Vision Transformers, register tokens are added that serve as explicit attention sinks, absorbing artifacts that would otherwise corrupt feature maps. Models like DINOv2 with registers show dramatically cleaner attention maps and better downstream performance compared to models without registers.

Register tokens in ViTs — **Figure 19:** Vision register tokens in DeiT-III, OpenCLIP, and DINOv2. Without registers (left), attention maps show artifacts. With registers (right), attention maps are clean and semantically meaningful.

Pillar 2: Interpretation

Mechanistic Interpretation of Attention Sink

Understanding why attention sink emerges is crucial for developing principled solutions. Five major theoretical frameworks have been proposed, each offering unique insights into the mechanisms driving this phenomenon. These theories are complementary rather than competing, illuminating different aspects of a complex, multi-faceted behavior.

Softmax Limitations & No-Op Theory

Key Takeaway: The Softmax function's sum-to-one constraint forces attention heads to allocate weight somewhere, even when no token is truly relevant. Sink tokens serve as "attention dumps" for heads performing near-identity (no-op) operations.

Among the earliest explanations, this theory attributes attention sink to an inherent limitation of the Softmax function. In standard attention, the sum-to-one constraint requires that attention weights over all keys normalize to unity for every query. When an attention head has learned that no meaningful interaction exists for certain query positions, it cannot assign zero attention everywhere. Instead, it concentrates residual probability mass on a convenient dump target, typically the first token, creating the characteristic sink pattern.

The Softmax Sum-to-One Problem, Explained Simply

Softmax is the function that converts raw attention scores into probabilities. Its key property is that all output values must sum to exactly 1.0 (100%). Here's the problem: imagine you have 100 tokens and an attention head that genuinely doesn't need to focus on any of them for a particular computation. With softmax, it must distribute 100% of its attention somewhere. It can't say "I don't care about any of these." So what does it do? It dumps most of that forced attention onto the first token, a convenient "trash bin" for unwanted attention weight. This is the no-op theory: some attention heads are essentially doing nothing (a "no operation"), but softmax forces them to pretend they're attending to something.

Attention layer analysis — **Figure 22:** Detailed attention pattern analysis providing evidence for the no-op theory. Attention weight heatmaps (left) and value state visualizations (right) show near-identity operations in sink-dominated heads.

Outlier Circuits

Key Takeaway: Massive activation outliers in specific hidden dimensions create the numerical conditions that sustain attention sink. These outliers form interconnected circuits across layers that amplify and maintain the sink pattern.

The Outlier Circuits perspective addresses a gap left by the Softmax theory: how are attention sinks numerically sustained? This framework identifies systematic outlier activations, specific hidden dimensions with extreme magnitudes, that form interconnected circuits across transformer layers. These outliers emerge in the FFN down-projection, propagate through residual connections, and influence the Q/K dot products that determine attention scores, creating a self-reinforcing loop that maintains the sink pattern.

Outlier Circuits: A Concrete Example

Imagine a specific neuron in the model (say, channel #256 in a 4096-dimensional hidden state) that has learned to produce extremely large values, maybe 1000x larger than its neighbors. This "outlier" channel creates a domino effect:

The FFN (Feed-Forward Network) produces a huge spike in that channel for the first token
This spike propagates through the residual connection to all subsequent layers
When computing attention, the Query and Key vectors inherit this outlier, making the first token's key vector uniquely large
The dot product between any query and this oversized key produces a huge score
After softmax, this score dominates, creating the attention sink

The lifecycle in Figure 29 shows this process beautifully: the outlier emerges at Layer 1, stabilizes through most of the network, then dissipates near the final layer.

Outlier activations — **Figure 25:** 3D activation visualizations in LLaMA-2 showing extreme outlier spikes in specific channels that drive the attention sink phenomenon

Attention sink lifecycle — **Figure 29:** Complete lifecycle of attention sink in LLaMA2-7B across all layers: Initial (Layer 0), Emergence (Layer 1), Stabilization (Layers 2-29), Dissipation (Layer 30), and Final (Layer 31). Shows how sink emerges, stabilizes, and eventually dissipates.

Implicit Attention Bias

Key Takeaway: Attention to the sink token effectively acts as a learned bias term in the attention output. The value updates from the sink token are nearly constant across all positions, functioning as a global bias rather than content-dependent processing.

This interpretation views attention sink from a functional perspective: the attention weight allocated to the sink token produces a constant value update across all query positions. Since the value vector associated with the sink token is effectively the same regardless of what the rest of the sequence contains, the resulting contribution is a fixed bias added to every position's representation. This elegant theory explains why removing sink tokens is so disruptive, as it removes a learned bias that the model has come to depend on.

In simpler terms: Think of the sink token's value vector as a "baseline setting" for the model. Every position gets this same baseline added to its representation. It's like a camera's white balance: it doesn't change based on what's in the photo, but removing it makes everything look wrong.

**Figure 30:** Value update decomposition showing that contributions from the sink token are nearly constant across all positions, functioning as an implicit bias

Geometric Anchoring

Key Takeaway: Initial tokens occupy a distinctive geometric position in the embedding space, forming clusters that act as stable "anchors" attracting attention from all other positions.

This theory examines attention sink through the lens of representation geometry. PCA analysis reveals that initial tokens form distinctive geometric clusters in the embedding space, separate from the manifold occupied by content tokens. With RoPE positional encoding, this separation is even more pronounced, as the encoding creates a natural ordering where initial positions become geometric anchors. The angular proximity of initial token representations to all query vectors explains why they consistently attract high attention scores.

PCA embedding analysis — **Figure 32:** PCA projections showing token embeddings at different layers. Initial tokens form distinctive geometric clusters that act as attention anchors. RoPE amplifies this separation.

Other Mechanistic Interpretations

Beyond the four major theories, emerging interpretations explore attention sink through information-theoretic perspectives, training dynamics analysis, and connections to loss landscape geometry. These complementary viewpoints continue to enrich our understanding of why transformers consistently develop this behavior pattern.

Pillar 3: Mitigation

Strategic Mitigation of Attention Sink

While utilization strategies work with attention sink, mitigation strategies aim to reduce or eliminate unwanted attention concentration through architectural modifications. Four main approaches have emerged, each targeting different aspects of the mechanism that produces attention sink.

Gated Attention Mechanisms

Key Takeaway: Adding a learnable gate vector G alongside Q, K, V allows the model to explicitly suppress attention sink behavior. The gate controls how much attention information flows through, decoupling the no-op function from attention allocation.

Gated Attention Mechanisms directly respond to the Softmax/No-Op theory. Since attention sink emerges because heads learn to perform no-op operations through the attention mechanism, adding a gate allows the model to achieve the same no-op effect by simply closing the gate, freeing the attention weights to focus on semantically meaningful content. Variants include input-state gating, value-state gating, and attention output gating, each applying the gate at different points in the attention computation.

Gating in Practice: A Production Example

If you're building an LLM service and want to reduce attention sink, gated attention is one of the most practical options. The idea is adding a small learned vector G (same size as Q, K, V) that acts like a volume knob for each attention head. When the head wants to do a no-op (dump attention on the first token), it can now just "turn down the volume" via the gate instead. This means the attention weights are freed to focus on actual content. The overhead is minimal: one extra linear projection per layer, roughly a 3% parameter increase for substantial quality gains.

**Figure 36:** Gated attention mechanism with an additional gate vector G that controls the flow of attention output

Gated attention variants — **Figure 39:** Three gated attention variants: Vanilla Attention (standard), Input-State Gated, and Value-State Gated, each applying sigmoid gates at different points

Modified Softmax Functions

Key Takeaway: Replacing the standard Softmax with alternatives like Softpick or SigSoftmax breaks the sum-to-one constraint that forces attention sink, allowing heads to express "no strong preference" without dumping weight on a single token.

Modified Softmax Functions offer another direct approach to mitigating attention sink by intervening in the Softmax normalization itself. Unlike gated mechanisms that decouple no-op behavior via an additional pathway, these approaches directly address the root cause: the sum-to-one constraint. Alternatives like Softpick allow attention weights to be truly sparse, Softmax1 adds a bias unit that can absorb excess probability, and SigSoftmax combines sigmoid and softmax for more flexible distributions.

Softpick vs Softmax: Standard softmax forces all attention weights to sum to 1. Softpick relaxes this constraint, allowing individual weights to be between 0 and 1 independently. This means a head can genuinely assign low attention to all tokens when it has nothing useful to compute, eliminating the need for a dump target.

**Figure 41:** Softmax vs Softpick comparison. Softmax (red boxes) shows strong sink patterns while Softpick (green boxes) produces more distributed attention without the first-column concentration.

Learnable Attention Bias

Learnable Attention Bias adds trainable bias terms directly to the attention scores before Softmax normalization. By providing an explicit learnable parameter to capture positional preferences, the model no longer needs to use the first token as an implicit bias mechanism. This approach is simple to implement, adds minimal parameters, and can be applied to existing architectures with fine-tuning.

Pre-training Interventions

Key Takeaway: The choice of optimizer during pre-training significantly impacts attention sink formation. Muon optimizer produces more uniform activation distributions, reducing the outlier spikes that drive attention sink, compared to Adam which creates extreme channel-specific activations.

Pre-training interventions address attention sink at its origin, during model training. The Muon optimizer, for example, produces dramatically more uniform activation distributions compared to Adam, which tends to create extreme outlier spikes in specific channels. By preventing the formation of outlier circuits during training, these interventions can reduce attention sink without any architectural modifications.

Optimizer comparison — **Figure 42:** Comparison of FFN input activations with different optimizers: (a) Adam creates extreme outlier spikes, (b) Muon produces more uniform distributions, (c) Muon with OSP further smooths activations

Applications & Practical Guidelines

Attention sink knowledge has practical implications across nine key areas of transformer model development and deployment. Understanding and managing attention sink can improve model quality, efficiency, safety, and capability.

Model Pre-training

Design training procedures that account for attention sink emergence, including optimizer selection and explicit sink token strategies

Model Tuning

Fine-tune attention patterns post-training through LoRA on attention weights, bias injection, or attention redistribution

Model Inference

Optimize KV cache management, sparse attention, and token pruning strategies that preserve sink tokens for stable inference

Interpretability

Use attention sink patterns as diagnostic tools for understanding model behavior and identifying attention head specialization

Reducing Hallucination

Redirect attention from sink tokens to factual content to reduce hallucinated outputs in both text and multi-modal generation

Safety & Robustness

Detect backdoor attacks and adversarial inputs by analyzing attention sink disruption patterns

General Capability

Improve overall model quality through better attention distribution across semantically relevant tokens

Long-Context Enhancement

Enable efficient processing of long sequences through sink-aware KV cache compression and streaming attention

Multi-Modal Enhancement

Improve cross-modal understanding by redistributing attention from text sink tokens to visual or audio content

Practical Checklist: When to Care About Attention Sink

Not every ML project needs to worry about attention sink. Here's a quick guide:

Deploying long-context LLMs? Attention sink awareness is critical for KV cache management
Building vision-language models? Attention redistribution can significantly improve visual understanding
Fine-tuning for reduced hallucination? Redirecting attention from sink tokens to content tokens helps
Training from scratch? Consider Muon optimizer and explicit sink tokens for cleaner attention patterns
Short-context classification tasks? Attention sink impact is minimal; standard approaches work fine

Spotlight: Attention Sink & Hallucination

Excessive attention to sink tokens diverts the model's focus from actual content. In vision-language models, this means the model attends to the BOS token instead of the image, generating descriptions of things that aren't there. The attention map below shows how the sink token (bright column) correlates with hallucinated text output.

Attention sink and hallucination — **Figure 20:** Attention map showing the relationship between sink tokens and hallucination. The bright column indicates attention sink, with the model generating hallucinated content.

Spotlight: Safety & Backdoor Detection

Attention sink analysis enables new approaches to AI safety. By examining how attention patterns shift around potential trigger tokens, researchers can identify and localize backdoor attacks. The attention sink helps identify where the backdoor is embedded, while value norm analysis reveals how it operates.

Backdoor detection via attention sink — **Figure 21:** Attention sink in machine unlearning and backdoor detection, showing how attention patterns help identify and neutralize planted backdoors

Challenges & Future Directions

Current Challenges

Computational overhead: Efficient and accurate detection of dynamic sinks remains an open challenge, as dynamic identification incurs additional computational overhead
Kernel compatibility: Many mitigation techniques operate on attention scores after Softmax, limiting compatibility with hardware-optimized attention kernels like FlashAttention
Theory unification: The five mechanistic theories remain largely independent; a unified framework explaining all aspects of attention sink is still lacking
Cross-architecture generalization: Techniques developed for one architecture may not transfer well to others (LLMs vs ViTs vs MoE models)
Evaluation standardization: There is no standardized benchmark for measuring attention sink intensity and mitigation effectiveness across models

Future Directions

Efficient AS handling: Lightweight detection of dynamic sinks, efficient attention redistribution, and low-latency gated attention implementations
Hardware-native solutions: Designing attention sink mitigation that works within FlashAttention and other optimized kernels rather than around them
Unified mechanistic theory: Combining softmax constraints, outlier circuits, implicit bias, and geometric anchoring into a comprehensive framework
Sink-free architectures: Designing next-generation transformers that inherently avoid attention sink through architectural innovations
Multi-modal optimization: Developing attention sink management strategies specifically designed for vision-language and other multi-modal architectures

Conclusion

This survey presents the first comprehensive review of Attention Sink in Transformer architectures, systematically synthesizing over 200 studies across three dimensions: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Attention sink profoundly influences training dynamics, inference efficiency, and model behavior across LLMs, Vision Transformers, MoE models, and multi-modal architectures. By mapping the landscape of existing research and identifying open challenges, we aim to empower researchers and practitioners to effectively manage attention sink within the current transformer paradigm while inspiring the development of next-generation architectures.

Keywords

Attention Sink Transformer Large Language Model Attention Mechanism KV Cache Vision Transformer Softmax Survey

References

A. Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers," NAACL, 2019.
A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition," ICLR, 2021.
T. Brown et al., "Language Models are Few-Shot Learners," NeurIPS, 2020.
H. Touvron et al., "LLaMA: Open and Efficient Foundation Language Models," arXiv, 2023.
A. Jiang et al., "Mistral 7B," arXiv, 2023.
G. Xiao et al., "Efficient Streaming Language Models with Attention Sinks," ICLR, 2024.
S. Darcet et al., "Vision Transformers Need Registers," ICLR, 2024.
Full reference list available in the original paper.