Attention Residuals: Replacing Fixed Accumulation with Learned Depth-wise Attention

The Core Idea

Standard residual connections are the backbone of modern LLMs. The update rule h_l = h_l-1 + f(h_l-1) provides a gradient highway that enables stable training. However, with PreNorm (the dominant paradigm), this fixed accumulation causes hidden-state magnitudes to grow as O(L) with depth, progressively diluting each layer's relative contribution.

AttnRes draws on a fundamental insight: depth-wise accumulation in residual networks is formally dual to sequential recurrence in RNNs. Just as Transformers improved upon RNNs by replacing fixed recurrence with attention over sequence positions, AttnRes replaces fixed depth-wise accumulation with attention over layer outputs.

What does "residual connection" actually mean?

Think of a deep neural network as a chain of processing steps (layers). In a residual connection, each layer's output is added to its input before passing to the next layer. It's like having a shortcut that lets information skip over a layer unchanged. This makes training much more stable, but the original design treats every layer's contribution equally — like averaging all employees' opinions with equal weight regardless of expertise. AttnRes instead lets the network learn which layers are most useful, like assigning weight to each employee's opinion based on relevance to the current question.

Figure 1: Overview of Attention Residuals architectures — **Figure 1:** Overview of Attention Residuals. (a) Standard Residuals: fixed additive accumulation. (b) Full AttnRes: each layer attends to all previous layer outputs. (c) Block AttnRes: layers grouped into blocks for practical scalability.

Standard Residuals

Fixed unit weights accumulate all layer outputs uniformly. Hidden states grow as O(L) with depth, diluting each layer's contribution. No mechanism to adapt mixing across depth.

h_l = h_l-1 + f(h_l-1)

Full AttnRes

Softmax attention over all preceding layer outputs. Input-dependent, learned weights via a pseudo-query. Optimal performance but O(Ld) memory.

α_l = softmax(φ(w_l, k_j))

Block AttnRes

Layers partitioned into blocks, attending over block-level representations. Reduces memory from O(L) to O(N). Practical drop-in replacement with minimal overhead.

O(N) memory, N << L

Introduction

Standard residual connections are the de facto building block of modern LLMs. The update h_l = h_l-1 + f_l-1(h_l-1) provides a gradient highway that lets gradients bypass transformations via identity mappings, enabling stable training at depth. Yet residuals also play a second, less-discussed role: they define how each layer's output is aggregated into a single, growing hidden state.

In practice, PreNorm has become the dominant paradigm, yet its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth. This progressively dilutes each layer's relative contribution. Early-layer information is buried and cannot be selectively retrieved. Empirically, authors observe that the first and last layers often have outsized influence while middle layers contribute little.

The paper observes a formal duality between depth-wise accumulation and the sequential recurrence in RNNs. Building on this duality, they propose Attention Residuals (AttnRes), which replaces the fixed accumulation with h_l = Σ α_l→j · v_j, where the α are softmax attention weights computed from a single-head dot-product between a learned per-layer query and the preceding layer outputs.

In standard training, Full AttnRes adds negligible overhead since the required layer outputs are already retained for backpropagation. At scale, however, activation recomputation and pipeline parallelism are routinely employed. Block AttnRes addresses this by partitioning layers into blocks, using cache-based P2P communication and a two-phase inference strategy.

Key Contributions

Attention Residuals: Replaces fixed residual accumulation with learned softmax attention over depth, plus Block AttnRes that reduces memory from O(Ld) to O(Nd).
Infrastructure for scale: Cross-stage caching eliminates redundant transfers under pipeline parallelism; two-phase inference amortizes cross-block attention via online softmax merge.
I/O analysis: Block AttnRes achieves only 5.5d total I/O per layer (vs 3d for standard residuals and 34d for mHC).
Comprehensive evaluation: Consistent scaling law improvements, gains on all 15 downstream benchmarks in Kimi Linear 48B, and stabilized training dynamics.

Motivation: Why Fixed Residuals Fall Short

Residual learning is critical for training deep networks. Each layer updates the hidden state as h_l = h_l-1 + f_l-1(h_l-1). Expanding this recurrence, the hidden state at layer l equals the sum of the embedding and all preceding layer outputs: h_l = h₁ + Σf_i(h_i). The identity mapping provides a direct gradient path from the loss to any layer.

However, the fixed unit coefficients treat every layer's contribution uniformly. Highway networks relax this with learned element-wise gates, interpolating between transformation and identity. But both approaches share a fundamental constraint: each layer can only access its immediate input h_l-1, a single compressed state that conflates all earlier outputs.

This means there is (1) no selective retrieval of specific earlier-layer features, (2) no direct gradient pathway from deeper layers to individual earlier layers, and (3) representational bottleneck where all prior computation is compressed into a single state vector.

These limitations mirror the well-known bottlenecks of RNNs in sequence modeling, where the fixed sequential recurrence was eventually replaced by attention. This parallel motivates the core proposal: replace fixed depth-wise accumulation with attention-based aggregation.

Why is the RNN analogy so important here?

RNNs (Recurrent Neural Networks) process sequences one step at a time, compressing all previous information into a single hidden state. This bottleneck was a well-known limitation — old information gets "forgotten" as new information arrives. Transformers solved this for sequences by letting each position look back at all previous positions using attention.

The paper's key insight is that residual connections have exactly the same bottleneck, but across depth instead of time. Each layer can only see the compressed sum of all previous layers' outputs, just like an RNN can only see the compressed state. AttnRes applies the same fix: let each layer attend to all previous layers individually.

Real-world analogy: Imagine a relay race where each runner can only hear the latest message from the previous runner (standard residuals). AttnRes is like giving each runner a radio to hear all previous runners directly.

Attention Residuals: Method

The key insight is a duality between time and depth. Like RNNs over time, residual connections compress all prior information into a single state over depth. For sequence modeling, the Transformer improved upon RNNs by replacing recurrence with attention, allowing each position to selectively access all previous positions. AttnRes applies the same principle to the depth dimension.

The general form replaces the fixed accumulation with: h_l = Σ α_l→j · v_j, where α are layer-specific attention weights satisfying Σα = 1. Unlike sequence length (which can reach millions), network depth is typically modest (L < 1000), making O(L²) attention over depth computationally feasible.

Full Attention Residuals

The attention weights are computed as α_l→j = φ(q_l, k_j) using a kernel function φ. The authors adopt φ(q, k) = exp(q^T RMSNorm(k)) with softmax normalization. The query q = w_l is a layer-specific learnable parameter (not input-dependent), which is a deliberate design choice enabling parallel computation.

The RMSNorm inside φ prevents layers with large-magnitude outputs from dominating the attention weights. For each token, Full AttnRes requires O(L²d) arithmetic and O(Ld) memory. Since depth is far smaller than sequence length, the cost is modest.

Zero overhead in standard training: The O(Ld) memory overlaps entirely with activations already retained for backpropagation. The pseudo-query independence also means attention weights for any group of layers can be computed in parallel without waiting for sequential layer execution.

What is a "pseudo-query"?

In standard attention (like in a Transformer), the query comes from the current input data. In Full AttnRes, the query w_l is a learned parameter — a fixed vector that the model learns during training, not derived from the input. This is a deliberate choice: it means attention weights for different layers can be computed in parallel, since they don't depend on each other's results. The trade-off is slightly less expressiveness (the query doesn't adapt to the specific input), but the ablation study shows this cost is small.

Block Attention Residuals

Block AttnRes partitions the L layers into N blocks of S = L/N layers each. Within each block, layer outputs are reduced to a single representation via summation. Across blocks, full attention is applied over only N block-level representations plus the token embedding. This reduces both memory from O(L) to O(N) and computation from O(L²) to O(N²).

The block count N interpolates between two extremes: N = L recovers Full AttnRes, while N = 1 reduces to standard residual connections. In practice, S = 4 (i.e., 4 layers per block) captures most of the benefit while keeping overhead minimal.

The two-phase computation strategy enables efficient inference: Phase 1 computes inter-block attention for all S layers simultaneously via a batched query against cached block representations. Phase 2 computes intra-block attention sequentially, then merges with Phase 1 results through online softmax. This amortizes memory access costs across the block.

Understanding the Two-Phase Strategy

Imagine you manage a company with 54 employees (layers) organized into 9 departments (blocks). You need each employee's summary report.

Phase 1 (Inter-block): First, you collect one summary from each of the 9 department heads. All department summaries are available at once, so you batch-process them efficiently.
Phase 2 (Intra-block): Then, within each department, you go through individual employees sequentially. For each employee, you merge their local view with the cross-department summary using online softmax (a mathematically exact way to combine two softmax results without recomputing from scratch).

The beauty is that Phase 1's cost is amortized across all layers in a block, so each individual layer pays only a fraction of the inter-block attention cost.

Algorithm 1: Two-phase computation for Block AttnRes — **Algorithm 1:** Two-phase computation for Block AttnRes. Phase 1 batches inter-block queries; Phase 2 handles sequential intra-block attention with online softmax merge.

Infrastructure Design

Training at Scale

For small-scale training, AttnRes adds negligible computation overhead and no extra memory usage. Under large-scale distributed training, pipeline parallelism poses the primary infrastructure challenge: Full AttnRes requires every pipeline stage to access all preceding stages' layer outputs, which are not locally available under pipeline parallelism.

Cross-stage caching solves this: since each physical stage processes multiple virtual stages in succession, blocks received during earlier virtual stages are cached locally and need not be re-transmitted. This reduces peak per-transition cost from O(C) to O(P), a V× improvement that enables full overlap with computation. The measured end-to-end overhead is less than 4%.

What is pipeline parallelism?

Pipeline parallelism splits a model across multiple GPUs so that each GPU handles a subset of layers. Data flows through them like a factory assembly line. The challenge for AttnRes is that each "station" (GPU) needs to know the outputs from previous stations, which requires extra communication. Cross-stage caching reduces this by remembering what was already sent, so only new information needs to be transmitted.

Figure 2: Cache-based pipeline communication — **Figure 2:** Cache-based pipeline communication. Hatched boxes denote end-of-block boundaries. Each rank caches previously received blocks, so stage transitions only transmit incremental blocks.

Memory overhead is negligible: with cross-stage caching, each block is stored exactly once across all virtual stages, which is tiny relative to standard per-layer activation cache.

Inference Optimization

The two-phase computation strategy applies to both Full and Block AttnRes. A naive implementation would compute attention at every layer, requiring a full pass over all block representations each time. Instead, Phase 1 batches all S queries within a block into a single pass, then Phase 2 handles the sequential intra-block lookback with online softmax merge.

With this design, the total per-layer I/O cost for Block AttnRes is only 5.5d (reads + writes), compared to 3d for standard residuals and a massive 34d for mHC. Phase 1 can also partially overlap with computation, further hiding its cost.

What does "I/O cost" mean in this context?

In modern GPUs, the bottleneck is often not computation but memory bandwidth — how fast data can be read from and written to memory. "I/O cost" measures the total bytes of data each layer needs to read and write. Block AttnRes achieves 5.5d per layer (where d is the model dimension, typically ~1024-4096), which is remarkably close to the 3d baseline cost and far better than mHC's 34d. For a model with d=4096, this means each layer moves roughly 22KB vs. 12KB (baseline) vs. 139KB (mHC).

Table 1: I/O cost comparison — **Table 1:** Memory access cost per token per layer. Block AttnRes achieves near-standard overhead at 5.5d total I/O, while mHC requires 34d.

Experiments

Scaling Laws

Five model sizes (194M to 528M activated parameters) were trained with three variants each: PreNorm baseline, Full AttnRes, and Block AttnRes with ~8 blocks. All variants share identical hyperparameters and data within each size group, isolating the effect of the residual mechanism.

Figure 3: Scaling law curves — **Figure 3:** Scaling law curves. Both Full and Block AttnRes consistently outperform the baseline across all compute budgets. Block AttnRes matches the loss of a baseline trained with 1.25× more compute.

The fitted scaling curves show: Baseline follows L = 1.891 × C^-0.057, Block AttnRes fits L = 1.870 × C^-0.058, and Full AttnRes fits L = 1.865 × C^-0.057. All three exhibit similar slope, but AttnRes consistently achieves lower loss. At the largest scale, the gap between Full and Block AttnRes narrows to just 0.001.

What is a "scaling law"?

A scaling law describes the predictable relationship between how much compute you spend training a model and how good the model gets. The equation L = a × C^b says the loss decreases as a power of compute. When AttnRes achieves a lower coefficient 'a', it means the model starts better at every compute level — equivalent to getting 25% more value from the same GPU budget.

Table 2: Scaling law results — **Table 2:** Model configurations and validation loss across five sizes. Full AttnRes achieves the best loss at every scale (bolded values).

Main Results: Kimi Linear 48B

The full Kimi Linear 48B configuration uses 27 Transformer blocks (54 layers) with MoE, yielding 48B total and 3B activated parameters. Block AttnRes is applied with 6 layers per block, producing 9 blocks. The model is pre-trained on 1.4T tokens with a 4096-token context window.

Training dynamics analysis reveals three key benefits: (1) Lower validation loss throughout training, with the gap widening during the decay phase. (2) Uniform output magnitudes across depth, eliminating the PreNorm dilution where deeper layers must learn increasingly large outputs. (3) Stabilized gradient distribution, preventing disproportionately large gradients in early layers.

Why do the output/gradient magnitude graphs matter?

These graphs reveal a fundamental problem with standard residuals called PreNorm dilution:

Output magnitude (panel b): With standard residuals, deeper layers must produce increasingly large outputs to "be heard" above the accumulated noise. AttnRes keeps all layers' contributions at similar magnitudes.
Gradient magnitude (panel c): Standard residuals give the earliest layers disproportionately large gradients (they update too much) while later layers barely update. AttnRes distributes learning effort more evenly.

Think of it like a classroom where, with standard residuals, the front-row students shout louder and louder while back-row students whisper. AttnRes gives everyone a microphone calibrated to the same volume.

Figure 4: Training dynamics — **Figure 4:** Training dynamics comparison. (a) Validation loss over training steps. (b) Output magnitude per transformer block. (c) Gradient magnitude per block. AttnRes achieves uniform magnitudes across depth.

+7.5

GPQA-Diamond

+3.6

Math

+3.1

HumanEval

Table 3: Downstream benchmark results — **Table 3:** Downstream benchmark results on 15 tasks across General, Math & Code, and Chinese categories. AttnRes matches or outperforms the baseline on all benchmarks.

Block AttnRes matches or outperforms the baseline on all 15 benchmarks. Improvements are particularly pronounced on multi-step reasoning tasks such as GPQA-Diamond (+7.5) and Math (+3.6), as well as code generation like HumanEval (+3.1). Knowledge-oriented benchmarks like MMLU and HellaSwag also show modest gains.

Ablation Study

Ablation studies on the 436M model validate key design choices. All variants share identical hyperparameters and compute budget, isolating each component's contribution.

Input-dependent query further lowers loss to 1.731, but introduces extra computation per layer, so the default uses a learned (static) query.
Input-independent mixing (removing query/key, using learnable scalars) hurts performance significantly (1.749 vs 1.737), confirming the importance of content-dependent aggregation.
Softmax vs sigmoid: Sigmoid degrades performance (1.741) due to the lack of competitive normalization that forces sharper selection among sources.
Block size S=4 nearly matches Full AttnRes (1.746 vs 1.737), offering an excellent trade-off between performance and memory overhead.
RMSNorm on keys is essential: removing it degrades both Full (1.743) and Block (1.750) AttnRes by preventing magnitude-based attention dominance.

Analysis

Optimal Architecture

A controlled architecture sweep under fixed compute (~6.5 × 10¹⁹ FLOPs) explores how AttnRes reshapes the optimal depth-width trade-off. Both Baseline and AttnRes reach their optima at H/L_b ≈ 0.3, but AttnRes achieves lower loss in all 25 configurations tested, with improvements ranging from 0.019 to 0.063. AttnRes favors deeper, narrower models, suggesting it can exploit additional depth more effectively.

Figure 5: Block size vs validation loss — **Figure 5:** Block size (S) vs validation loss. Smaller blocks approach Full AttnRes performance, with S=4 capturing most of the gain.

Figure 6: Architecture sweep heatmaps — **Figure 6:** Architecture sweep heatmaps under fixed compute. Left: Baseline. Right: AttnRes. AttnRes achieves lower loss across all configurations, especially for deeper models.

Notably, a lower d_model/L_b ratio corresponds to a deeper, narrower network. AttnRes's preference for depth aligns with its mechanism: deeper models generate more layer outputs for the attention to select from, increasing the expressiveness of depth-wise aggregation. However, deeper models generally incur higher inference latency.

Learned AttnRes Patterns

Visualization of the learned attention weights reveals how AttnRes distributes attention over previous sources. Each heatmap shows how the l-th attention or MLP layer (rows) allocates its attention over previous sources (columns), with pre-attention and pre-MLP layers shown separately.

Figure 7: Learned AttnRes attention patterns — **Figure 7:** Learned AttnRes attention patterns for full (top) and block (bottom) variants, averaged over tokens. Pre-attention and pre-MLP layers are shown separately.

Preserved locality: Each layer attends most strongly to its immediate predecessor, yet selective off-diagonal concentrations emerge, indicating learned skip connections beyond the standard residual path.
Layer specialization: The embedding h₁ retains non-trivial weight throughout. Pre-MLP inputs show sharper diagonal reliance, while pre-attention inputs maintain broader receptive fields.
Block AttnRes preserves structure: Diagonal dominance, embedding persistence, and layer specialization all transfer from the full to block variant, suggesting block-wise compression acts as implicit regularization.

Discussions

Sequence-Depth Duality

Residual connections propagate information over depth via a fixed recurrence, much as RNNs propagate over time. This duality extends to richer variants: data-dependent gates on the sequence side correspond to Highway networks on the depth side; the delta rule corresponds to DDL; MRLA mirrors gated linear attention. All these methods treat layers as if they were time steps, sharing the same algebraic structure.

AttnRes completes this analogy by bringing full softmax attention to the depth dimension, just as Transformers brought it to the sequence dimension. Block AttnRes corresponds to block-sparse attention, trading some expressiveness for computational efficiency.

Sequence-Depth Duality: A Deeper Look

This section reveals an elegant theoretical insight: every technique invented for processing sequences has a direct counterpart for processing depth. The mapping works like this:

RNN ↔ Standard Residuals: Fixed recurrence/accumulation
Gated RNN (LSTM/GRU) ↔ Highway Networks: Learned gates controlling information flow
Linear Attention ↔ mHC: Multi-stream recurrence with matrix transitions
Full Softmax Attention (Transformer) ↔ AttnRes: Dynamic, selective access to all preceding steps

This duality is not just a metaphor — the mathematical forms are identical. It suggests that future improvements in sequence modeling could be directly ported to the depth dimension.

Table 6: Structured matrix comparison — **Table 6:** Depth mixing matrices for Full AttnRes (left, full lower-triangular) and Block AttnRes (right, block-structured). Background colors group entries that share the same source block.

Residual Connections as Structured Matrices

All residual connection variants can be unified as a depth mixing matrix M ∈ R^L×L, where M_l→j is the weight that layer l assigns to the output of layer j. Standard residuals produce an all-ones lower-triangular M. Highway networks yield rank-1 factors. AttnRes produces input-dependent, dense lower-triangular M with softmax normalization.

This perspective reveals that existing residual variants are instances of linear attention over the depth axis. The unrolled (m)HC weight is mathematically equivalent to a gated linear attention transition. AttnRes goes further by using full softmax attention, which enables competitive normalization and sharper selection among source layers.

Related Work

AttnRes is unique among residual connection methods in providing dynamic, input-dependent weights with access to all preceding layer outputs. Previous methods either use fixed/static weights (standard residuals, DenseFormer) or only access the immediately preceding layer (Highway, ReZero). Multi-state recurrence methods like mHC expand the recurrence width but add significant I/O overhead (34d vs 5.5d).

Table 5: Comparison of residual update mechanisms — **Table 5:** Comprehensive comparison of residual update mechanisms. AttnRes provides dynamic weights with full cross-layer access, unique among all methods listed.

Conclusion

Inspired by the duality between sequence and depth, AttnRes replaces fixed, uniform residual accumulation with learned, input-dependent depth-wise attention. The method is validated through ablation studies, scaling law experiments, and integration into a production-scale 48B-parameter model pre-trained on 1.4T tokens. Block AttnRes emerges as the practical variant, achieving most of the gains with minimal overhead.

Key Takeaways

1.25× compute advantage in scaling law experiments: Block AttnRes matches the loss of a baseline trained with 25% more compute.
All 15 benchmarks improved in Kimi Linear 48B, with standout gains on GPQA-Diamond (+7.5), Math (+3.6), and HumanEval (+3.1).
Stabilized training dynamics: Uniform output magnitudes and gradient distribution across depth, eliminating the PreNorm dilution problem.
Minimal overhead: Only 5.5d I/O per layer (vs 3d baseline), less than 4% wall-clock training overhead, drop-in replacement for standard residuals.

References (selected)

Austin et al. Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732
Bachlechner et al. ReZero is All You Need. 2020. arXiv: 2003.04887
Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2016.
Chen, Wei. Post-LayerNorm Is Back. 2026. arXiv: 2601.19895
Chen et al. Evaluating Large Language Models Trained on Code. 2021.
Clark et al. Think you have Solved Question Answering? Try ARC. 2018.
Cobbe et al. Training Verifiers to Solve Math Word Problems. 2021.
De Sa et al. Low-Rank and Diagonal Approximations for Contact Sequence Kernels. 2018.
DeepSeek-AI. DeepSeek-V3. 2024.
Dong et al. Multi-Resolution Linear Attention. 2025. arXiv: 2502.17839
Dosovitskiy et al. An Image is Worth 16x16 Words. 2021.
He et al. Deep Residual Learning for Image Recognition. 2015.
Hendrycks et al. Measuring Massive Multitask Language Understanding. 2021.
Li et al. CMath: Can Your Language Model Pass Chinese Math Test? 2024.
Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks. 2010.
Hagele et al. Scaling Data-Constrained Language Models. 2024.
Huang et al. Densely Connected Convolutional Networks (DenseNet). 2018.
Huang et al. GPipe: Efficient Training of Giant Neural Networks. 2019.
Ioffe, Szegedy. Batch Normalization. 2015.
Jordan et al. KEEL: Knowledge Enhanced Ensemble Learning. 2024.