Textual Frequency Law on Large Language Models
When two sentences have the same meaning, LLMs consistently perform better with the more common phrasing. This paper formalizes this as the Textual Frequency Law and validates it across math reasoning, machine translation, commonsense reasoning, and tool calling.
High-frequency prompting improves math reasoning accuracy across all tested models (GSM8K benchmark).
High-frequency rephrasing boosts math accuracy with no model changes.
Consistent gains across closed-source models.
Open-source models benefit even more from frequency-aware prompting.
While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction investigating the effect of textual frequency on LLMs — the Textual Frequency Law (TFL), which suggests that when the meanings are kept the same, data with higher sentence-level frequency should be preferred to ones with lower frequency for LLMs, both for prompting and fine-tuning. We validate TFL on four tasks: math reasoning, machine translation, commonsense reasoning, and tool calling. We also propose Textual Frequency Distillation (TFD) to further enhance the frequency estimation, and Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in increasing order of sentence-level frequency.
Large language models have transformed NLP — excelling at reasoning, translation, and coding. Training data quality is known to matter, but which dimension of quality? This paper asks a new question: when two paraphrases carry the same meaning, does the more commonly-used phrasing lead to better LLM performance?
Inspired by human cognitive research — where high-frequency words are processed faster in the brain — the authors propose that the same phenomenon applies to LLMs. Higher-frequency expressions consistently lead to better model performance, even when the semantic content is identical.
This connection between human language processing and LLM behavior suggests a fundamental property of how these models internalize language during training. The implication is practical: by simply choosing more common phrasings, users and developers can improve LLM outputs at zero additional cost.
Three components form the Textual Frequency Framework: TFL, TFD, and CTFT.
Given a set of paraphrases P (all carrying the same meaning), TFL selects the one with the highest sentence-level frequency. This is computed as the geometric mean of word-level frequencies — a simple formula that requires no access to the LLM's actual training data.
Selection objective:
Sentence-level frequency as geometric mean of word frequencies:
Word frequencies are estimated using the Zipf-scale WordFreq library — freely available, no LLM training data needed. This makes TFL practical for any user or application.
Offline frequency from web corpora doesn't perfectly reflect what an LLM has internalized during training. TFD bridges this gap: ask the LLM itself to complete stories based on each training sentence, creating a "distilled" frequency estimate F₂ that reflects LLM-internal frequency more accurately.
The combined frequency score F(x) blends offline (F₁) and distilled (F₂) estimates. The strengthening factor ξ amplifies F₂ when F₁ is near zero, recovering estimates for rare words. TFD improves MT win-rates from ~13% (without TFD) to 86.7–100% (with TFD).
For fine-tuning, the order of training data matters. Low-frequency expressions are more linguistically diverse — presenting them first gives the model broader exposure. The curriculum then builds toward high-frequency examples for reinforcement and consolidation.
Training instances are sorted by ascending frequency score F(Iₙ). This low→high ordering consistently outperforms random ordering, easy-to-hard curriculum, and high-to-low reverse curriculum across diverse language pairs.
The Textual Frequency Paired Dataset (TFPD) was created specifically for this research. Starting from three existing benchmarks — GSM8K (math), FLORES-200 (translation), and CommonsenseQA (reasoning) — GPT-4o-mini generates 20 paraphrases per sentence: 10 using rare/complex words, 10 using common/simple words.
Human annotators verify semantic equivalence. Only sentence pairs where all annotators agree on "same meaning" are retained. This rigorous filtering ensures the frequency differences are not confounded by semantic changes.
High-frequency prompting improves GSM8K math accuracy across all three tested models — two closed-source (DeepSeek-V3, GPT-4o-mini) and one open-source (LLaMA3.3-70B-Instruct). The gains are consistent and substantial, averaging ~8 percentage points:
Importantly, questions already answered correctly on low-frequency inputs remain correct with high-frequency inputs — the gain is one-directional improvement with no regressions. Analysis of chain-of-thought traces shows that higher-frequency inputs also improve the quality of reasoning steps, not just the final answer.
Machine translation experiments span 100 languages from FLORES-200, measuring BLEU, chrF, and COMET scores. The Textual Frequency Law generalizes broadly: high-frequency source sentences improve translation quality across diverse language families.
The critical finding is the role of Textual Frequency Distillation. Without TFD, using raw offline frequency is unreliable — it only wins ~13% of the time on MT. With TFD, the win rate jumps to 86.7–100% across all model and metric combinations.
Key insight: Offline word frequency (from web corpora) is a rough proxy for what LLMs have learned. TFD corrects this mismatch by asking the LLM itself to "distill" frequency estimates — making the selection oracle much more accurate.
CTFT (low→high frequency curriculum) outperforms all baselines on machine translation fine-tuning:
The rationale: low-frequency expressions are more linguistically diverse, so training on them first gives the model broader coverage. High-frequency examples then consolidate and reinforce the most common patterns. This ordering mirrors how humans learn language — exposure to varied forms before settling on common usage.
Linguistic analysis reveals the structural differences that make high-frequency text easier for LLMs to process. The pattern mirrors what's known about human reading: simpler syntactic structures are processed faster and more accurately.
High-frequency sentences have lower Max Dependency Tree Depth (5.02 vs 5.18 for math), meaning simpler grammatical structures that are easier to parse.
Lower Flesch-Kincaid Grade Level (4.36 vs 6.35 for math tasks) confirms that high-frequency text is more accessible — even for language models.
Pearson and Spearman correlations between textual frequency and model accuracy are consistently positive (0.03–0.28), confirming a causal-style relationship.
This paper establishes Adam's Law: the Textual Frequency Law for LLMs. When two expressions have the same meaning, the higher-frequency variant reliably produces better LLM performance — across prompting and fine-tuning, validated on four diverse NLP tasks.
The framework's three components — TFL, TFD, and CTFT — together provide a practical pipeline for improving LLM performance without changing model architecture or adding training data. The key insight: LLMs, like humans, process common language more effectively.
Future work could investigate frequency effects for other LLM capabilities, develop more efficient frequency estimation methods, and explore cross-lingual frequency interactions in multilingual models.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.