---
arxiv_id: 2603.15031
title: "Attention Residuals"
authors:
  - Kimi Team
  - Guangyu Chen
  - Yu Zhang
  - Jianlin Su
  - Weixin Xu
  - Siyuan Pan
  - Yaoyu Wang
  - Yucheng Wang
  - Guanduo Chen
  - Bohong Yin
  - Yutian Chen
  - Junjie Yan
  - Ming Wei
  - Y. Zhang
  - Fanqing Meng
  - Chao Hong
  - Xiaotong Xie
  - Shaowei Liu
  - Enzhe Lu
  - Yunpeng Tai
  - Yanru Chen
  - Xin Men
  - Haiqing Guo
  - Y. Charles
  - Haoyu Lu
  - Lin Sui
  - Jinguo Zhu
  - Zaida Zhou
  - Weiran He
  - Weixiao Huang
  - Xinran Xu
  - Yuzhi Wang
  - Guokun Lai
  - Yulun Du
  - Yuxin Wu
  - Zhilin Yang
  - Xinyu Zhou
difficulty: Advanced
tags:
  - LLM
  - Reasoning
published_at: 2026-03-16
flecto_url: https://flecto.zer0ai.dev/papers/2603.15031/
lang: en
---

## Head Title

### Attention Residuals: Replacing Fixed Accumulation with Learned Depth-wise Attention

## Hero H1

### Attention Residuals

## Hero Subtitle

### Replacing Fixed Accumulation with Learned, Input-Dependent Depth-wise Attention in LLMs

## Hero Authors

### Kimi Team (MoonshotAI)

## Hero Abstract

Standard residual connections in LLMs accumulate all layer outputs with fixed unit weights, causing uncontrolled growth and diluting each layer's contribution. Attention Residuals (AttnRes) replaces this fixed accumulation with softmax attention over preceding layer outputs , enabling each layer to selectively aggregate earlier representations with learned, input-dependent weights. Block AttnRes makes this practical at scale with minimal overhead.

## Key Idea H2

### The Core Idea

## Key Idea P1

Standard residual connections are the backbone of modern LLMs. The update rule h l = h l-1 + f(h l-1 ) provides a gradient highway that enables stable training. However, with PreNorm (the dominant paradigm), this fixed accumulation causes hidden-state magnitudes to grow as O(L) with depth, progressively diluting each layer's relative contribution.

## Key Idea P2

AttnRes draws on a fundamental insight: depth-wise accumulation in residual networks is formally dual to sequential recurrence in RNNs . Just as Transformers improved upon RNNs by replacing fixed recurrence with attention over sequence positions, AttnRes replaces fixed depth-wise accumulation with attention over layer outputs.

## Key Idea Figcaption

Figure 1: Overview of Attention Residuals. (a) Standard Residuals: fixed additive accumulation. (b) Full AttnRes: each layer attends to all previous layer outputs. (c) Block AttnRes: layers grouped into blocks for practical scalability.

## Key Idea Card1 H3

### Standard Residuals

## Key Idea Card1 P

Fixed unit weights accumulate all layer outputs uniformly. Hidden states grow as O(L) with depth, diluting each layer's contribution. No mechanism to adapt mixing across depth.

## Key Idea Card2 H3

### Full AttnRes

## Key Idea Card2 P

Softmax attention over all preceding layer outputs. Input-dependent, learned weights via a pseudo-query. Optimal performance but O(Ld) memory.

## Key Idea Card3 H3

### Block AttnRes

## Key Idea Card3 P

Layers partitioned into blocks, attending over block-level representations. Reduces memory from O(L) to O(N). Practical drop-in replacement with minimal overhead.

## Introduction H2

### Introduction

## Introduction P1

Standard residual connections are the de facto building block of modern LLMs. The update h l = h l-1 + f l-1 (h l-1 ) provides a gradient highway that lets gradients bypass transformations via identity mappings, enabling stable training at depth. Yet residuals also play a second, less-discussed role: they define how each layer's output is aggregated into a single, growing hidden state.

## Introduction P2

In practice, PreNorm has become the dominant paradigm, yet its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth. This progressively dilutes each layer's relative contribution. Early-layer information is buried and cannot be selectively retrieved. Empirically, authors observe that the first and last layers often have outsized influence while middle layers contribute little.

## Introduction P3

The paper observes a formal duality between depth-wise accumulation and the sequential recurrence in RNNs . Building on this duality, they propose Attention Residuals (AttnRes), which replaces the fixed accumulation with h l = &Sigma; &alpha; l&rarr;j &middot; v j , where the &alpha; are softmax attention weights computed from a single-head dot-product between a learned per-layer query and the preceding layer outputs.

## Introduction P4

In standard training, Full AttnRes adds negligible overhead since the required layer outputs are already retained for backpropagation. At scale, however, activation recomputation and pipeline parallelism are routinely employed. Block AttnRes addresses this by partitioning layers into blocks, using cache-based P2P communication and a two-phase inference strategy.

## Introduction Contributions H3

### Key Contributions

## Introduction Contribution1

Attention Residuals: Replaces fixed residual accumulation with learned softmax attention over depth, plus Block AttnRes that reduces memory from O(Ld) to O(Nd).

## Introduction Contribution2

Infrastructure for scale: Cross-stage caching eliminates redundant transfers under pipeline parallelism; two-phase inference amortizes cross-block attention via online softmax merge.

## Introduction Contribution3

I/O analysis: Block AttnRes achieves only 5.5d total I/O per layer (vs 3d for standard residuals and 34d for mHC).

## Introduction Contribution4

Comprehensive evaluation: Consistent scaling law improvements, gains on all 15 downstream benchmarks in Kimi Linear 48B, and stabilized training dynamics.

## Motivation H2

### Motivation: Why Fixed Residuals Fall Short

## Motivation P1

Residual learning is critical for training deep networks. Each layer updates the hidden state as h l = h l-1 + f l-1 (h l-1 ) . Expanding this recurrence, the hidden state at layer l equals the sum of the embedding and all preceding layer outputs: h l = h 1 + &Sigma;f i (h i ). The identity mapping provides a direct gradient path from the loss to any layer.

## Motivation P2

However, the fixed unit coefficients treat every layer's contribution uniformly. Highway networks relax this with learned element-wise gates, interpolating between transformation and identity. But both approaches share a fundamental constraint: each layer can only access its immediate input h l-1 , a single compressed state that conflates all earlier outputs.

## Motivation P3

This means there is (1) no selective retrieval of specific earlier-layer features, (2) no direct gradient pathway from deeper layers to individual earlier layers, and (3) representational bottleneck where all prior computation is compressed into a single state vector.

## Motivation P4

These limitations mirror the well-known bottlenecks of RNNs in sequence modeling, where the fixed sequential recurrence was eventually replaced by attention. This parallel motivates the core proposal: replace fixed depth-wise accumulation with attention-based aggregation .

## Methodology H2

### Attention Residuals: Method

## Methodology Unified P1

The key insight is a duality between time and depth . Like RNNs over time, residual connections compress all prior information into a single state over depth. For sequence modeling, the Transformer improved upon RNNs by replacing recurrence with attention, allowing each position to selectively access all previous positions. AttnRes applies the same principle to the depth dimension.

## Methodology Unified P2

The general form replaces the fixed accumulation with: h l = &Sigma; &alpha; l&rarr;j &middot; v j , where &alpha; are layer-specific attention weights satisfying &Sigma;&alpha; = 1. Unlike sequence length (which can reach millions), network depth is typically modest (L < 1000), making O(L 2 ) attention over depth computationally feasible.

## Methodology Full H3

### Full Attention Residuals

## Methodology Full P1

The attention weights are computed as &alpha; l&rarr;j = &phi;(q l , k j ) using a kernel function &phi;. The authors adopt &phi;(q, k) = exp(q T RMSNorm(k)) with softmax normalization. The query q = w l is a layer-specific learnable parameter (not input-dependent), which is a deliberate design choice enabling parallel computation.

## Methodology Full P2

The RMSNorm inside &phi; prevents layers with large-magnitude outputs from dominating the attention weights. For each token, Full AttnRes requires O(L 2 d) arithmetic and O(Ld) memory. Since depth is far smaller than sequence length, the cost is modest.

## Methodology Full P3

Zero overhead in standard training: The O(Ld) memory overlaps entirely with activations already retained for backpropagation. The pseudo-query independence also means attention weights for any group of layers can be computed in parallel without waiting for sequential layer execution.

## Methodology Block H3

### Block Attention Residuals

## Methodology Block P1

Block AttnRes partitions the L layers into N blocks of S = L/N layers each. Within each block, layer outputs are reduced to a single representation via summation. Across blocks, full attention is applied over only N block-level representations plus the token embedding. This reduces both memory from O(L) to O(N) and computation from O(L 2 ) to O(N 2 ).

## Methodology Block P2

The block count N interpolates between two extremes: N = L recovers Full AttnRes , while N = 1 reduces to standard residual connections . In practice, S = 4 (i.e., 4 layers per block) captures most of the benefit while keeping overhead minimal.

## Methodology Block P3

The two-phase computation strategy enables efficient inference: Phase 1 computes inter-block attention for all S layers simultaneously via a batched query against cached block representations. Phase 2 computes intra-block attention sequentially, then merges with Phase 1 results through online softmax. This amortizes memory access costs across the block.

## Methodology Algorithm Figcaption

Algorithm 1: Two-phase computation for Block AttnRes. Phase 1 batches inter-block queries; Phase 2 handles sequential intra-block attention with online softmax merge.

## Infrastructure H2

### Infrastructure Design

## Infrastructure Training H3

### Training at Scale

## Infrastructure Training P1

For small-scale training, AttnRes adds negligible computation overhead and no extra memory usage. Under large-scale distributed training, pipeline parallelism poses the primary infrastructure challenge: Full AttnRes requires every pipeline stage to access all preceding stages' layer outputs, which are not locally available under pipeline parallelism.

## Infrastructure Training P2

Cross-stage caching solves this: since each physical stage processes multiple virtual stages in succession, blocks received during earlier virtual stages are cached locally and need not be re-transmitted. This reduces peak per-transition cost from O(C) to O(P), a V&times; improvement that enables full overlap with computation. The measured end-to-end overhead is less than 4% .

## Infrastructure Pipeline Figcaption

Figure 2: Cache-based pipeline communication. Hatched boxes denote end-of-block boundaries. Each rank caches previously received blocks, so stage transitions only transmit incremental blocks.

## Infrastructure Training P3

Memory overhead is negligible: with cross-stage caching, each block is stored exactly once across all virtual stages, which is tiny relative to standard per-layer activation cache.

## Infrastructure Inference H3

### Inference Optimization

## Infrastructure Inference P1

The two-phase computation strategy applies to both Full and Block AttnRes. A naive implementation would compute attention at every layer, requiring a full pass over all block representations each time. Instead, Phase 1 batches all S queries within a block into a single pass, then Phase 2 handles the sequential intra-block lookback with online softmax merge.

## Infrastructure Inference P2

With this design, the total per-layer I/O cost for Block AttnRes is only 5.5d (reads + writes), compared to 3d for standard residuals and a massive 34d for mHC. Phase 1 can also partially overlap with computation, further hiding its cost.

## Infrastructure Io Figcaption

Table 1: Memory access cost per token per layer. Block AttnRes achieves near-standard overhead at 5.5d total I/O, while mHC requires 34d.

## Experiments H2

### Experiments

## Experiments Scaling H3

### Scaling Laws

## Experiments Scaling P1

Five model sizes (194M to 528M activated parameters) were trained with three variants each: PreNorm baseline, Full AttnRes, and Block AttnRes with ~8 blocks. All variants share identical hyperparameters and data within each size group, isolating the effect of the residual mechanism.

## Experiments Scaling Figcaption

Figure 3: Scaling law curves. Both Full and Block AttnRes consistently outperform the baseline across all compute budgets. Block AttnRes matches the loss of a baseline trained with 1.25&times; more compute.

## Experiments Scaling P2

The fitted scaling curves show: Baseline follows L = 1.891 &times; C -0.057 , Block AttnRes fits L = 1.870 &times; C -0.058 , and Full AttnRes fits L = 1.865 &times; C -0.057 . All three exhibit similar slope, but AttnRes consistently achieves lower loss. At the largest scale, the gap between Full and Block AttnRes narrows to just 0.001.

## Experiments Scaling Table Figcaption

Table 2: Model configurations and validation loss across five sizes. Full AttnRes achieves the best loss at every scale (bolded values).

## Experiments Main H3

### Main Results: Kimi Linear 48B

## Experiments Main P1

The full Kimi Linear 48B configuration uses 27 Transformer blocks (54 layers) with MoE, yielding 48B total and 3B activated parameters. Block AttnRes is applied with 6 layers per block, producing 9 blocks. The model is pre-trained on 1.4T tokens with a 4096-token context window.

## Experiments Main P2

Training dynamics analysis reveals three key benefits: (1) Lower validation loss throughout training, with the gap widening during the decay phase. (2) Uniform output magnitudes across depth, eliminating the PreNorm dilution where deeper layers must learn increasingly large outputs. (3) Stabilized gradient distribution , preventing disproportionately large gradients in early layers.

## Experiments Dynamics Figcaption

Figure 4: Training dynamics comparison. (a) Validation loss over training steps. (b) Output magnitude per transformer block. (c) Gradient magnitude per block. AttnRes achieves uniform magnitudes across depth.

## Experiments Benchmark Figcaption

Table 3: Downstream benchmark results on 15 tasks across General, Math & Code, and Chinese categories. AttnRes matches or outperforms the baseline on all benchmarks.

## Experiments Benchmark P

Block AttnRes matches or outperforms the baseline on all 15 benchmarks . Improvements are particularly pronounced on multi-step reasoning tasks such as GPQA-Diamond (+7.5) and Math (+3.6), as well as code generation like HumanEval (+3.1). Knowledge-oriented benchmarks like MMLU and HellaSwag also show modest gains.

## Experiments Ablation H3

### Ablation Study

## Experiments Ablation P

Ablation studies on the 436M model validate key design choices. All variants share identical hyperparameters and compute budget, isolating each component's contribution.

## Experiments Ablation Figcaption

Table 4: Ablation results. Full AttnRes achieves 1.737, and input-dependent query further improves to 1.731. DenseFormer (1.767) performs no better than the baseline (1.766).

## Experiments Ablation Li1

Input-dependent query further lowers loss to 1.731, but introduces extra computation per layer, so the default uses a learned (static) query.

## Experiments Ablation Li2

Input-independent mixing (removing query/key, using learnable scalars) hurts performance significantly (1.749 vs 1.737), confirming the importance of content-dependent aggregation.

## Experiments Ablation Li3

Softmax vs sigmoid: Sigmoid degrades performance (1.741) due to the lack of competitive normalization that forces sharper selection among sources.

## Experiments Ablation Li4

Block size S=4 nearly matches Full AttnRes (1.746 vs 1.737), offering an excellent trade-off between performance and memory overhead.

## Experiments Ablation Li5

RMSNorm on keys is essential: removing it degrades both Full (1.743) and Block (1.750) AttnRes by preventing magnitude-based attention dominance.

## Analysis H2

### Analysis

## Analysis Optimal H3

### Optimal Architecture

## Analysis Optimal P

A controlled architecture sweep under fixed compute (~6.5 &times; 10 19 FLOPs) explores how AttnRes reshapes the optimal depth-width trade-off. Both Baseline and AttnRes reach their optima at H/L b &asymp; 0.3, but AttnRes achieves lower loss in all 25 configurations tested, with improvements ranging from 0.019 to 0.063. AttnRes favors deeper, narrower models , suggesting it can exploit additional depth more effectively.

## Analysis Blocksize Figcaption

Figure 5: Block size (S) vs validation loss. Smaller blocks approach Full AttnRes performance, with S=4 capturing most of the gain.

## Analysis Heatmap Figcaption

Figure 6: Architecture sweep heatmaps under fixed compute. Left: Baseline. Right: AttnRes. AttnRes achieves lower loss across all configurations, especially for deeper models.

## Analysis Optimal P2

Notably, a lower d model /L b ratio corresponds to a deeper, narrower network. AttnRes's preference for depth aligns with its mechanism: deeper models generate more layer outputs for the attention to select from, increasing the expressiveness of depth-wise aggregation. However, deeper models generally incur higher inference latency.

## Analysis Patterns H3

### Learned AttnRes Patterns

## Analysis Patterns P

Visualization of the learned attention weights reveals how AttnRes distributes attention over previous sources. Each heatmap shows how the l-th attention or MLP layer (rows) allocates its attention over previous sources (columns), with pre-attention and pre-MLP layers shown separately.

## Analysis Patterns Figcaption

Figure 7: Learned AttnRes attention patterns for full (top) and block (bottom) variants, averaged over tokens. Pre-attention and pre-MLP layers are shown separately.

## Analysis Patterns Li1

Preserved locality: Each layer attends most strongly to its immediate predecessor, yet selective off-diagonal concentrations emerge, indicating learned skip connections beyond the standard residual path.

## Analysis Patterns Li2

Layer specialization: The embedding h 1 retains non-trivial weight throughout. Pre-MLP inputs show sharper diagonal reliance, while pre-attention inputs maintain broader receptive fields.

## Analysis Patterns Li3

Block AttnRes preserves structure: Diagonal dominance, embedding persistence, and layer specialization all transfer from the full to block variant, suggesting block-wise compression acts as implicit regularization.

## Discussions H2

### Discussions

## Discussions Duality H3

### Sequence-Depth Duality

## Discussions Duality P1

Residual connections propagate information over depth via a fixed recurrence, much as RNNs propagate over time. This duality extends to richer variants: data-dependent gates on the sequence side correspond to Highway networks on the depth side; the delta rule corresponds to DDL; MRLA mirrors gated linear attention. All these methods treat layers as if they were time steps, sharing the same algebraic structure.

## Discussions Duality P2

AttnRes completes this analogy by bringing full softmax attention to the depth dimension, just as Transformers brought it to the sequence dimension. Block AttnRes corresponds to block-sparse attention, trading some expressiveness for computational efficiency.

## Discussions Matrix Figcaption

Table 6: Depth mixing matrices for Full AttnRes (left, full lower-triangular) and Block AttnRes (right, block-structured). Background colors group entries that share the same source block.

## Discussions Matrices H3

### Residual Connections as Structured Matrices

## Discussions Matrices P1

All residual connection variants can be unified as a depth mixing matrix M &isin; R L&times;L , where M l&rarr;j is the weight that layer l assigns to the output of layer j. Standard residuals produce an all-ones lower-triangular M. Highway networks yield rank-1 factors. AttnRes produces input-dependent, dense lower-triangular M with softmax normalization.

## Discussions Matrices P2

This perspective reveals that existing residual variants are instances of linear attention over the depth axis . The unrolled (m)HC weight is mathematically equivalent to a gated linear attention transition. AttnRes goes further by using full softmax attention, which enables competitive normalization and sharper selection among source layers.

## Related Work H2

### Related Work

## Related Work P

AttnRes is unique among residual connection methods in providing dynamic, input-dependent weights with access to all preceding layer outputs . Previous methods either use fixed/static weights (standard residuals, DenseFormer) or only access the immediately preceding layer (Highway, ReZero). Multi-state recurrence methods like mHC expand the recurrence width but add significant I/O overhead (34d vs 5.5d).

## Related Work Figcaption

Table 5: Comprehensive comparison of residual update mechanisms. AttnRes provides dynamic weights with full cross-layer access, unique among all methods listed.

## Conclusion H2

### Conclusion

## Conclusion P

Inspired by the duality between sequence and depth, AttnRes replaces fixed, uniform residual accumulation with learned, input-dependent depth-wise attention. The method is validated through ablation studies, scaling law experiments, and integration into a production-scale 48B-parameter model pre-trained on 1.4T tokens. Block AttnRes emerges as the practical variant, achieving most of the gains with minimal overhead.

## Conclusion Summary H3

### Key Takeaways

## Conclusion Summary Li1

1.25&times; compute advantage in scaling law experiments: Block AttnRes matches the loss of a baseline trained with 25% more compute.

## Conclusion Summary Li2

All 15 benchmarks improved in Kimi Linear 48B, with standout gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).

## Conclusion Summary Li3

Stabilized training dynamics: Uniform output magnitudes and gradient distribution across depth, eliminating the PreNorm dilution problem.

## Conclusion Summary Li4

Minimal overhead: Only 5.5d I/O per layer (vs 3d baseline), less than 4% wall-clock training overhead, drop-in replacement for standard residuals.

## References Summary

### References (selected)

## Hero Arxiv Button

### Read on arXiv ↗