Replacing Fixed Accumulation with Learned, Input-Dependent Depth-wise Attention in LLMs
Standard residual connections in LLMs accumulate all layer outputs with fixed unit weights, causing uncontrolled growth and diluting each layer's contribution. Attention Residuals (AttnRes) replaces this fixed accumulation with softmax attention over preceding layer outputs, enabling each layer to selectively aggregate earlier representations with learned, input-dependent weights. Block AttnRes makes this practical at scale with minimal overhead.
Standard residual connections are the backbone of modern LLMs. The update rule hl = hl-1 + f(hl-1) provides a gradient highway that enables stable training. However, with PreNorm (the dominant paradigm), this fixed accumulation causes hidden-state magnitudes to grow as O(L) with depth, progressively diluting each layer's relative contribution.
AttnRes draws on a fundamental insight: depth-wise accumulation in residual networks is formally dual to sequential recurrence in RNNs. Just as Transformers improved upon RNNs by replacing fixed recurrence with attention over sequence positions, AttnRes replaces fixed depth-wise accumulation with attention over layer outputs.
Think of a deep neural network as a chain of processing steps (layers). In a residual connection, each layer's output is added to its input before passing to the next layer. It's like having a shortcut that lets information skip over a layer unchanged. This makes training much more stable, but the original design treats every layer's contribution equally — like averaging all employees' opinions with equal weight regardless of expertise. AttnRes instead lets the network learn which layers are most useful, like assigning weight to each employee's opinion based on relevance to the current question.
Fixed unit weights accumulate all layer outputs uniformly. Hidden states grow as O(L) with depth, diluting each layer's contribution. No mechanism to adapt mixing across depth.
hl = hl-1 + f(hl-1)
Softmax attention over all preceding layer outputs. Input-dependent, learned weights via a pseudo-query. Optimal performance but O(Ld) memory.
αl = softmax(φ(wl, kj))
Layers partitioned into blocks, attending over block-level representations. Reduces memory from O(L) to O(N). Practical drop-in replacement with minimal overhead.
O(N) memory, N << L
Standard residual connections are the de facto building block of modern LLMs. The update hl = hl-1 + fl-1(hl-1) provides a gradient highway that lets gradients bypass transformations via identity mappings, enabling stable training at depth. Yet residuals also play a second, less-discussed role: they define how each layer's output is aggregated into a single, growing hidden state.
In practice, PreNorm has become the dominant paradigm, yet its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth. This progressively dilutes each layer's relative contribution. Early-layer information is buried and cannot be selectively retrieved. Empirically, authors observe that the first and last layers often have outsized influence while middle layers contribute little.
The paper observes a formal duality between depth-wise accumulation and the sequential recurrence in RNNs. Building on this duality, they propose Attention Residuals (AttnRes), which replaces the fixed accumulation with hl = Σ αl→j · vj, where the α are softmax attention weights computed from a single-head dot-product between a learned per-layer query and the preceding layer outputs.
In standard training, Full AttnRes adds negligible overhead since the required layer outputs are already retained for backpropagation. At scale, however, activation recomputation and pipeline parallelism are routinely employed. Block AttnRes addresses this by partitioning layers into blocks, using cache-based P2P communication and a two-phase inference strategy.
Residual learning is critical for training deep networks. Each layer updates the hidden state as hl = hl-1 + fl-1(hl-1). Expanding this recurrence, the hidden state at layer l equals the sum of the embedding and all preceding layer outputs: hl = h1 + Σfi(hi). The identity mapping provides a direct gradient path from the loss to any layer.
However, the fixed unit coefficients treat every layer's contribution uniformly. Highway networks relax this with learned element-wise gates, interpolating between transformation and identity. But both approaches share a fundamental constraint: each layer can only access its immediate input hl-1, a single compressed state that conflates all earlier outputs.
This means there is (1) no selective retrieval of specific earlier-layer features, (2) no direct gradient pathway from deeper layers to individual earlier layers, and (3) representational bottleneck where all prior computation is compressed into a single state vector.
These limitations mirror the well-known bottlenecks of RNNs in sequence modeling, where the fixed sequential recurrence was eventually replaced by attention. This parallel motivates the core proposal: replace fixed depth-wise accumulation with attention-based aggregation.
RNNs (Recurrent Neural Networks) process sequences one step at a time, compressing all previous information into a single hidden state. This bottleneck was a well-known limitation — old information gets "forgotten" as new information arrives. Transformers solved this for sequences by letting each position look back at all previous positions using attention.
The paper's key insight is that residual connections have exactly the same bottleneck, but across depth instead of time. Each layer can only see the compressed sum of all previous layers' outputs, just like an RNN can only see the compressed state. AttnRes applies the same fix: let each layer attend to all previous layers individually.
The key insight is a duality between time and depth. Like RNNs over time, residual connections compress all prior information into a single state over depth. For sequence modeling, the Transformer improved upon RNNs by replacing recurrence with attention, allowing each position to selectively access all previous positions. AttnRes applies the same principle to the depth dimension.
The general form replaces the fixed accumulation with: hl = Σ αl→j · vj, where α are layer-specific attention weights satisfying Σα = 1. Unlike sequence length (which can reach millions), network depth is typically modest (L < 1000), making O(L2) attention over depth computationally feasible.
The attention weights are computed as αl→j = φ(ql, kj) using a kernel function φ. The authors adopt φ(q, k) = exp(qT RMSNorm(k)) with softmax normalization. The query q = wl is a layer-specific learnable parameter (not input-dependent), which is a deliberate design choice enabling parallel computation.
The RMSNorm inside φ prevents layers with large-magnitude outputs from dominating the attention weights. For each token, Full AttnRes requires O(L2d) arithmetic and O(Ld) memory. Since depth is far smaller than sequence length, the cost is modest.
Zero overhead in standard training: The O(Ld) memory overlaps entirely with activations already retained for backpropagation. The pseudo-query independence also means attention weights for any group of layers can be computed in parallel without waiting for sequential layer execution.
In standard attention (like in a Transformer), the query comes from the current input data. In Full AttnRes, the query wl is a learned parameter — a fixed vector that the model learns during training, not derived from the input. This is a deliberate choice: it means attention weights for different layers can be computed in parallel, since they don't depend on each other's results. The trade-off is slightly less expressiveness (the query doesn't adapt to the specific input), but the ablation study shows this cost is small.
Block AttnRes partitions the L layers into N blocks of S = L/N layers each. Within each block, layer outputs are reduced to a single representation via summation. Across blocks, full attention is applied over only N block-level representations plus the token embedding. This reduces both memory from O(L) to O(N) and computation from O(L2) to O(N2).
The block count N interpolates between two extremes: N = L recovers Full AttnRes, while N = 1 reduces to standard residual connections. In practice, S = 4 (i.e., 4 layers per block) captures most of the benefit while keeping overhead minimal.
The two-phase computation strategy enables efficient inference: Phase 1 computes inter-block attention for all S layers simultaneously via a batched query against cached block representations. Phase 2 computes intra-block attention sequentially, then merges with Phase 1 results through online softmax. This amortizes memory access costs across the block.
Imagine you manage a company with 54 employees (layers) organized into 9 departments (blocks). You need each employee's summary report.
The beauty is that Phase 1's cost is amortized across all layers in a block, so each individual layer pays only a fraction of the inter-block attention cost.
For small-scale training, AttnRes adds negligible computation overhead and no extra memory usage. Under large-scale distributed training, pipeline parallelism poses the primary infrastructure challenge: Full AttnRes requires every pipeline stage to access all preceding stages' layer outputs, which are not locally available under pipeline parallelism.
Cross-stage caching solves this: since each physical stage processes multiple virtual stages in succession, blocks received during earlier virtual stages are cached locally and need not be re-transmitted. This reduces peak per-transition cost from O(C) to O(P), a V× improvement that enables full overlap with computation. The measured end-to-end overhead is less than 4%.
Pipeline parallelism splits a model across multiple GPUs so that each GPU handles a subset of layers. Data flows through them like a factory assembly line. The challenge for AttnRes is that each "station" (GPU) needs to know the outputs from previous stations, which requires extra communication. Cross-stage caching reduces this by remembering what was already sent, so only new information needs to be transmitted.
Memory overhead is negligible: with cross-stage caching, each block is stored exactly once across all virtual stages, which is tiny relative to standard per-layer activation cache.
The two-phase computation strategy applies to both Full and Block AttnRes. A naive implementation would compute attention at every layer, requiring a full pass over all block representations each time. Instead, Phase 1 batches all S queries within a block into a single pass, then Phase 2 handles the sequential intra-block lookback with online softmax merge.
With this design, the total per-layer I/O cost for Block AttnRes is only 5.5d (reads + writes), compared to 3d for standard residuals and a massive 34d for mHC. Phase 1 can also partially overlap with computation, further hiding its cost.
In modern GPUs, the bottleneck is often not computation but memory bandwidth — how fast data can be read from and written to memory. "I/O cost" measures the total bytes of data each layer needs to read and write. Block AttnRes achieves 5.5d per layer (where d is the model dimension, typically ~1024-4096), which is remarkably close to the 3d baseline cost and far better than mHC's 34d. For a model with d=4096, this means each layer moves roughly 22KB vs. 12KB (baseline) vs. 139KB (mHC).
Five model sizes (194M to 528M activated parameters) were trained with three variants each: PreNorm baseline, Full AttnRes, and Block AttnRes with ~8 blocks. All variants share identical hyperparameters and data within each size group, isolating the effect of the residual mechanism.
The fitted scaling curves show: Baseline follows L = 1.891 × C-0.057, Block AttnRes fits L = 1.870 × C-0.058, and Full AttnRes fits L = 1.865 × C-0.057. All three exhibit similar slope, but AttnRes consistently achieves lower loss. At the largest scale, the gap between Full and Block AttnRes narrows to just 0.001.
A scaling law describes the predictable relationship between how much compute you spend training a model and how good the model gets. The equation L = a × Cb says the loss decreases as a power of compute. When AttnRes achieves a lower coefficient 'a', it means the model starts better at every compute level — equivalent to getting 25% more value from the same GPU budget.
The full Kimi Linear 48B configuration uses 27 Transformer blocks (54 layers) with MoE, yielding 48B total and 3B activated parameters. Block AttnRes is applied with 6 layers per block, producing 9 blocks. The model is pre-trained on 1.4T tokens with a 4096-token context window.
Training dynamics analysis reveals three key benefits: (1) Lower validation loss throughout training, with the gap widening during the decay phase. (2) Uniform output magnitudes across depth, eliminating the PreNorm dilution where deeper layers must learn increasingly large outputs. (3) Stabilized gradient distribution, preventing disproportionately large gradients in early layers.
These graphs reveal a fundamental problem with standard residuals called PreNorm dilution:
Think of it like a classroom where, with standard residuals, the front-row students shout louder and louder while back-row students whisper. AttnRes gives everyone a microphone calibrated to the same volume.
Block AttnRes matches or outperforms the baseline on all 15 benchmarks. Improvements are particularly pronounced on multi-step reasoning tasks such as GPQA-Diamond (+7.5) and Math (+3.6), as well as code generation like HumanEval (+3.1). Knowledge-oriented benchmarks like MMLU and HellaSwag also show modest gains.
Ablation studies on the 436M model validate key design choices. All variants share identical hyperparameters and compute budget, isolating each component's contribution.
A controlled architecture sweep under fixed compute (~6.5 × 1019 FLOPs) explores how AttnRes reshapes the optimal depth-width trade-off. Both Baseline and AttnRes reach their optima at H/Lb ≈ 0.3, but AttnRes achieves lower loss in all 25 configurations tested, with improvements ranging from 0.019 to 0.063. AttnRes favors deeper, narrower models, suggesting it can exploit additional depth more effectively.
Notably, a lower dmodel/Lb ratio corresponds to a deeper, narrower network. AttnRes's preference for depth aligns with its mechanism: deeper models generate more layer outputs for the attention to select from, increasing the expressiveness of depth-wise aggregation. However, deeper models generally incur higher inference latency.
Visualization of the learned attention weights reveals how AttnRes distributes attention over previous sources. Each heatmap shows how the l-th attention or MLP layer (rows) allocates its attention over previous sources (columns), with pre-attention and pre-MLP layers shown separately.
Residual connections propagate information over depth via a fixed recurrence, much as RNNs propagate over time. This duality extends to richer variants: data-dependent gates on the sequence side correspond to Highway networks on the depth side; the delta rule corresponds to DDL; MRLA mirrors gated linear attention. All these methods treat layers as if they were time steps, sharing the same algebraic structure.
AttnRes completes this analogy by bringing full softmax attention to the depth dimension, just as Transformers brought it to the sequence dimension. Block AttnRes corresponds to block-sparse attention, trading some expressiveness for computational efficiency.
This section reveals an elegant theoretical insight: every technique invented for processing sequences has a direct counterpart for processing depth. The mapping works like this:
This duality is not just a metaphor — the mathematical forms are identical. It suggests that future improvements in sequence modeling could be directly ported to the depth dimension.
All residual connection variants can be unified as a depth mixing matrix M ∈ RL×L, where Ml→j is the weight that layer l assigns to the output of layer j. Standard residuals produce an all-ones lower-triangular M. Highway networks yield rank-1 factors. AttnRes produces input-dependent, dense lower-triangular M with softmax normalization.
This perspective reveals that existing residual variants are instances of linear attention over the depth axis. The unrolled (m)HC weight is mathematically equivalent to a gated linear attention transition. AttnRes goes further by using full softmax attention, which enables competitive normalization and sharper selection among source layers.
Inspired by the duality between sequence and depth, AttnRes replaces fixed, uniform residual accumulation with learned, input-dependent depth-wise attention. The method is validated through ablation studies, scaling law experiments, and integration into a production-scale 48B-parameter model pre-trained on 1.4T tokens. Block AttnRes emerges as the practical variant, achieving most of the gains with minimal overhead.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.