Adam's Law: Textual Frequency Law on LLMs

Key Results at a Glance

High-frequency prompting improves math reasoning accuracy across all tested models (GSM8K benchmark).

DeepSeek-V3

Math Reasoning (GSM8K)

63.55% → 71.54%

+7.99 pts

High-frequency rephrasing boosts math accuracy with no model changes.

GPT-4o-mini

Math Reasoning (GSM8K)

60.70% → 68.70%

+8.00 pts

Consistent gains across closed-source models.

LLaMA3.3-70B

Math Reasoning (GSM8K)

80.49% → 88.75%

+8.26 pts

Open-source models benefit even more from frequency-aware prompting.

Abstract

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction investigating the effect of textual frequency on LLMs — the Textual Frequency Law (TFL), which suggests that when the meanings are kept the same, data with higher sentence-level frequency should be preferred to ones with lower frequency for LLMs, both for prompting and fine-tuning. We validate TFL on four tasks: math reasoning, machine translation, commonsense reasoning, and tool calling. We also propose Textual Frequency Distillation (TFD) to further enhance the frequency estimation, and Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in increasing order of sentence-level frequency.

Introduction — Why Frequency Matters

Large language models have transformed NLP — excelling at reasoning, translation, and coding. Training data quality is known to matter, but which dimension of quality? This paper asks a new question: when two paraphrases carry the same meaning, does the more commonly-used phrasing lead to better LLM performance?

Inspired by human cognitive research — where high-frequency words are processed faster in the brain — the authors propose that the same phenomenon applies to LLMs. Higher-frequency expressions consistently lead to better model performance, even when the semantic content is identical.

This connection between human language processing and LLM behavior suggests a fundamental property of how these models internalize language during training. The implication is practical: by simply choosing more common phrasings, users and developers can improve LLM outputs at zero additional cost.

Three Contributions

Textual Frequency Law (TFL): When meanings are equivalent, prefer higher-frequency expressions for both prompting and fine-tuning.
Textual Frequency Distillation (TFD): Enhance offline frequency estimates using LLM-generated story completions, bridging the gap between web-corpus frequency and LLM-internal frequency.
Curriculum Textual Frequency Training (CTFT): Fine-tune LLMs in ascending frequency order — diverse low-frequency examples first, then high-frequency reinforcement.

Framework overview diagram — Figure 1: Framework overview. Top: High-frequency rephraser selects common paraphrases. Middle: Accuracy-Frequency Curve illustrates the TFL principle. Bottom: Translation quality comparison between low- and high-frequency inputs.

Proposed Approach

Three components form the Textual Frequency Framework: TFL, TFD, and CTFT.

01

Textual Frequency Law (TFL)

Given a set of paraphrases P (all carrying the same meaning), TFL selects the one with the highest sentence-level frequency. This is computed as the geometric mean of word-level frequencies — a simple formula that requires no access to the LLM's actual training data.

Selection objective:

$$\text{argmax}_{x \in P} \; s_{\text{freq}}(x, D)$$

What does “argmax over paraphrases” mean in plain language?

The formula picks the paraphrase that maximises the textual frequency score s_freq. In practice this means: out of all the ways you could rephrase the input, the model selects the version whose words appear most often in the text the LLM saw during training.

Why does this help? LLMs are statistically better calibrated on high-frequency text — they have seen more examples of it and have stronger internal representations for it. Feeding the model the highest-frequency paraphrase is like speaking the language the model knows best.

Sentence-level frequency as geometric mean of word frequencies:

$$s_{\text{freq}}(x, D) = \left(\prod_{k=1}^{K} w_{\text{freq}}(x_k, D)\right)^{1/K}$$

Geometric mean of word frequencies — a simple intuition

s_freq(x) is the geometric mean of the individual word frequencies in the text. Why geometric rather than arithmetic? The geometric mean prevents a single very-common word (e.g. “the”) from dominating the score. Every word contributes multiplicatively, so the score reflects how uniformly familiar the whole phrase is to the model — not just whether it contains one popular word.

Word frequencies are estimated using the Zipf-scale WordFreq library — freely available, no LLM training data needed. This makes TFL practical for any user or application.

02

Textual Frequency Distillation (TFD)

Offline frequency from web corpora doesn't perfectly reflect what an LLM has internalized during training. TFD bridges this gap: ask the LLM itself to complete stories based on each training sentence, creating a "distilled" frequency estimate F₂ that reflects LLM-internal frequency more accurately.

Why offline frequency ≠ LLM-internal frequency

Offline frequency is simply how often a word appears in a static corpus (e.g. Wikipedia). But when an LLM trains, it compresses and re-weights those statistics — the final model may treat a moderately common word as very salient, or a technically frequent word as noise.

TFD bypasses the raw corpus and uses the model's own output probabilities to measure what the model itself considers frequent. This makes the frequency signal much more faithful to how the LLM actually represents language internally.

$$F(x) = \alpha F_1(x) + (1 + \xi \cdot \mathbb{1}[F_1(x)=0]) \cdot \beta F_2(x)$$

Breaking down the F(x) combination formula

The final TFD score F(x) blends two complementary signals:

Offline score — word frequencies from a large static corpus (data-level evidence)
LLM-distilled score — the model's own output probabilities (model-level evidence)

A mixing coefficient λ (lambda) controls the balance. When λ = 1 you get pure offline frequency; when λ = 0 you get pure LLM-distilled frequency. In experiments, a mix outperforms either signal alone — they capture different facets of what the model knows well.

The combined frequency score F(x) blends offline (F₁) and distilled (F₂) estimates. The strengthening factor ξ amplifies F₂ when F₁ is near zero, recovering estimates for rare words. TFD improves MT win-rates from ~13% (without TFD) to 86.7–100% (with TFD).

03

Curriculum Textual Frequency Training (CTFT)

For fine-tuning, the order of training data matters. Low-frequency expressions are more linguistically diverse — presenting them first gives the model broader exposure. The curriculum then builds toward high-frequency examples for reinforcement and consolidation.

Why train on low-frequency text first? The curriculum rationale

Curriculum learning trains a model on easier examples before harder ones — but in CTFT the ordering is counter-intuitive: low-frequency (harder) examples come first.

The reasoning: if you show the model only high-frequency text at the start, it quickly over-specialises on common patterns and forgets rare ones. By exposing it to rare patterns early, before the model's weights are strongly biased, it builds durable representations for low-frequency content. High-frequency examples in the later stages then reinforce and stabilise these representations rather than overwriting them.

Think of it like language learning: practising irregular verbs (rare patterns) before drilling regular ones (common patterns) prevents the regular forms from drowning out your memory of the exceptions.

$$\text{sort}_{I \in T}(F(I_n))$$

Training instances are sorted by ascending frequency score F(Iₙ). This low→high ordering consistently outperforms random ordering, easy-to-hard curriculum, and high-to-low reverse curriculum across diverse language pairs.

Dataset — TFPD

The Textual Frequency Paired Dataset (TFPD) was created specifically for this research. Starting from three existing benchmarks — GSM8K (math), FLORES-200 (translation), and CommonsenseQA (reasoning) — GPT-4o-mini generates 20 paraphrases per sentence: 10 using rare/complex words, 10 using common/simple words.

Human annotators verify semantic equivalence. Only sentence pairs where all annotators agree on "same meaning" are retained. This rigorous filtering ensures the frequency differences are not confounded by semantic changes.

Math Reasoning (GSM8K)738 pairs

Machine Translation (FLORES-200)526 pairs

Commonsense Reasoning (CommonsenseQA)575 pairs

Tool Calling (TC)114 pairs

TFPD statistics table — Table 1: TFPD statistics. High-frequency and low-frequency partitions have similar sentence counts but differ in linguistic complexity.

Results — Math Reasoning

GSM8K solve rates bar chart — Figure 2: GSM8K solve rates (%) for three LLMs. High-frequency partition (grey) consistently outperforms low-frequency (yellow) across all models.

High-frequency prompting improves GSM8K math accuracy across all three tested models — two closed-source (DeepSeek-V3, GPT-4o-mini) and one open-source (LLaMA3.3-70B-Instruct). The gains are consistent and substantial, averaging ~8 percentage points:

DeepSeek-V3 63.55% → 71.54% (+7.99 pts)

GPT-4o-mini 60.70% → 68.70% (+8.00 pts)

LLaMA3.3-70B 80.49% → 88.75% (+8.26 pts)

Importantly, questions already answered correctly on low-frequency inputs remain correct with high-frequency inputs — the gain is one-directional improvement with no regressions. Analysis of chain-of-thought traces shows that higher-frequency inputs also improve the quality of reasoning steps, not just the final answer.

Results — Machine Translation

Machine translation experiments span 100 languages from FLORES-200, measuring BLEU, chrF, and COMET scores. The Textual Frequency Law generalizes broadly: high-frequency source sentences improve translation quality across diverse language families.

What are BLEU, chrF, and COMET?

Machine translation quality is measured with three complementary metrics:

BLEU — counts how many n-gram sequences (word groups) in the generated translation match the reference. High BLEU = high word-level overlap. Fast and widely used, but insensitive to paraphrases.
chrF — character-level F-score. Measures similarity at the character n-gram level, making it more sensitive to morphological variants (crucial for morphologically rich languages like Turkish or Finnish).
COMET — a neural metric trained on human judgement data. It directly predicts how a human translator would rate the output, making it the metric most correlated with real-world quality. Higher COMET scores generally mean the translation reads more naturally.

When all three improve together (as TFL achieves), the gain is robust across different aspects of translation quality.

MT radar charts across 100 languages — Figure 3: Radar charts showing BLEU, chrF, and COMET across 100 language pairs for ChatGPT and DeepSeek. High-frequency (orange/red) consistently covers larger areas than low-frequency (blue).

TFD ablation - win rates — Figure 4: TFD ablation. High-frequency with TFD achieves 86.7–100% win rate. Without TFD, performance is inconsistent (0–16.7%), demonstrating TFD's critical role.

The critical finding is the role of Textual Frequency Distillation. Without TFD, using raw offline frequency is unreliable — it only wins ~13% of the time on MT. With TFD, the win rate jumps to 86.7–100% across all model and metric combinations.

Key insight: Offline word frequency (from web corpora) is a rough proxy for what LLMs have learned. TFD corrects this mismatch by asking the LLM itself to "distill" frequency estimates — making the selection oracle much more accurate.

CR accuracy results table — Table 2: Commonsense Reasoning accuracy. High-frequency partition achieves better accuracy on all baseline models.

Curriculum Training Results (CTFT)

CTFT (low→high frequency curriculum) outperforms all baselines on machine translation fine-tuning:

Standard fine-tuning on low-frequency data (no curriculum)
Easy-to-hard curriculum learning (traditional approach)
High-to-low frequency order (reverse curriculum)
Original model with no fine-tuning

The rationale: low-frequency expressions are more linguistically diverse, so training on them first gives the model broader coverage. High-frequency examples then consolidate and reinforce the most common patterns. This ordering mirrors how humans learn language — exposure to varied forms before settling on common usage.

CTFT improvement vs data percentage line chart — Figure 5: CTFT improvement percentage vs. data percentage across 5 low-resource languages (BLEU). Most reach ~100% improvement at full data; Chinese (zho_Hans) reaches 100% improvement already at 60% data.

Analysis — Why Does Frequency Help?

Linguistic analysis reveals the structural differences that make high-frequency text easier for LLMs to process. The pattern mirrors what's known about human reading: simpler syntactic structures are processed faster and more accurately.

📐

Lower Syntactic Complexity

High-frequency sentences have lower Max Dependency Tree Depth (5.02 vs 5.18 for math), meaning simpler grammatical structures that are easier to parse.

📖

Simpler Reading Level

Lower Flesch-Kincaid Grade Level (4.36 vs 6.35 for math tasks) confirms that high-frequency text is more accessible — even for language models.

📊

Positive Performance Correlation

Pearson and Spearman correlations between textual frequency and model accuracy are consistently positive (0.03–0.28), confirming a causal-style relationship.

Linguistic statistics and frequency correlations — Table 3: Linguistic statistics comparing high-frequency vs low-frequency data. Max Dependency Tree Depth, Mean Dependency Tree Depth, Flesch-Kincaid Grade Level, and correlation coefficients shown.

Conclusion

This paper establishes Adam's Law: the Textual Frequency Law for LLMs. When two expressions have the same meaning, the higher-frequency variant reliably produces better LLM performance — across prompting and fine-tuning, validated on four diverse NLP tasks.

The framework's three components — TFL, TFD, and CTFT — together provide a practical pipeline for improving LLM performance without changing model architecture or adding training data. The key insight: LLMs, like humans, process common language more effectively.

Future work could investigate frequency effects for other LLM capabilities, develop more efficient frequency estimation methods, and explore cross-lingual frequency interactions in multilingual models.

References (click to expand)

Cobbe et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168.
DeepSeek-AI et al. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.
Grattafiori et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
NLLB-Team (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672.
Speer, R. (2022). rspeer/wordfreq. Zenodo.
Talmor et al. (2019). CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. NAACL.
Wei et al. (2024). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
Lu, H. & Lam, W. (2023). Curriculum Learning for Language Modeling. EMNLP.