arXiv:2602.21061 · cs.AI · February 2026
February 25, 2026
Can LLMs achieve superintelligence through reasoning alone — or do they need tools?
The Diligent Learner framework proposes that LLMs can achieve superintelligence via test-time search, provided the stepwise success probability γ remains non-vanishingly positive. This paper introduces a benchmark to directly measure γ on logical out-of-distribution inference.
The benchmark is built around a class of GF(2) circuit reconstruction tasks that grow harder with each reasoning step and are information-theoretically impossible to shortcut: a model must simultaneously integrate its accumulated history and newly observed evidence at every step. Strategies that rely solely on pattern-matching or memorized history will fail by construction.
Analysis shows that while γ declines superlinearly with depth for small LLMs, frontier models exhibit partial robustness — but only when using tool calls. Successful reasoning at scale is contingent upon precise tool use, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.
A GF(2) Boolean circuit reconstruction problem for testing γ in the Diligent Learner framework, where the correct next step is unique and shortcuts are information-theoretically blocked.
Evaluation across Bayesian optimal estimators, small LLMs (Qwen3-2507 family), and state-of-the-art frontier models (ChatGPT, Claude Opus, Gemini 3 Pro).
Identification of trends in γg collapse and an explanation for why certain models catastrophically fail as complexity increases, while tool-using models perform far better.
All code released on GitHub: github.com/Poggio-Lab/Tool-Building-as-a-Path-to-Superintelligence
The Diligent Learner formalizes reasoning as a depth-first search through a tree of candidate steps, guided by a validator that checks whether each step is logically consistent. A root-to-leaf path corresponds to a chain-of-thought; leaves are either DONE (a completed solution) or BACKTRACK (an incorrect path to abandon).
The critical parameter is γ (gamma): the probability that at each depth g, the model proposes a “good” next step that keeps the solution on track. Formally:
If γ stays non-vanishing with depth, search succeeds with only polynomial overhead:
But if γ collapses as depth grows, the search budget blows up exponentially and the framework’s guarantees evaporate. The central question this paper addresses: does γ always stay positive, or can it catastrophically degrade?
The benchmark tests whether an LLM can reconstruct a Boolean function over GF(2) — a mathematical field where addition is XOR. The target circuit is expressed as a sum of monomials: f(a,v) = t₁ ⊕ t₂ ⊕ … ⊕ tₙ. At each step g, the model receives two inputs:
The model must output the next monomial tg+1. To succeed, it must fuse both inputs — history alone or data alone is provably insufficient (information-theoretically). The oracle statistically masks the answer unless the solver has the full prefix, making shortcut strategies fail by design.
ANF is a way to write any Boolean function as a sum (XOR) of products (AND) of variables. For example: f(x₁,x₂,x₃) = x₁ ⊕ (x₂ AND x₃). Each product term is called a “monomial.” The benchmark task is essentially: given the first g monomials of a circuit, plus some labeled examples, predict the (g+1)-th monomial. This is like completing a polynomial — but over binary XOR arithmetic.
Knowing the sequence of prior steps provides zero predictive power about the next step when not accounting for the fresh evidence.
Labels are balanced (≈50% 0/1) so statistical frequency cannot reveal the answer. A model cannot trivially guess without real reasoning.
The fresh evidence without the prefix provides negligible signal. The Bayes advantage decays exponentially with the number of active prefix bits.
Solvers are distinguished by their information access. The benchmark is designed so that ming γA ≥ Q while γB, γC, γD ≈ 1⁄𝐶d−1 (near random for large depth).
Full access to both Prefix P and Evidence S. The ideal agent that integrates all available information.
Has Evidence S but not Prefix P. Cannot use the accumulated reasoning history.
Has Prefix P but not Evidence S. Cannot use the new step-specific data.
Partial access to both P and S. Degrades with depth, intermediate between B and A.
Only history + data together maintain reliable next-step prediction.
We evaluated four Bayesian estimator classes across depths g ∈ {1, 3, 7, 15, 31, 63, 127} on 2,000 generated circuits (adversarial sampling, p=12, d=4). Results are unambiguous:
As both reasoning depth g and problem size p increase, partial-information estimators converge to zero — confirming the information-theoretic design of the benchmark.
Small LLMs degrade just like partial-information estimators.
We evaluated four models from the Qwen3-2507 family on 3,000 instances across depths g ∈ {1, 3, 7, 15, 31} with adversarial sampling (p=12, d=4): 4B-Instruct, 4B-Thinking, 30B-A3B-Thinking, 30B-A3B-Instruct.
All models exhibit systematic γg decline with depth — even though a polynomial-time decoder provably exists at every step (Theorem B.1). “Thinking” variants perform better at shallow depths but still collapse sharply around g=15.
The paper proves in Appendix B (Theorem B.1) that a polynomial-time decoder always exists at every step — meaning, in principle, an algorithm can always find the correct next monomial efficiently. Yet even 30B-parameter models fail as depth increases. This disconnect reveals something important: having the right algorithm isn’t enough. The LLM must implement that algorithm through its attention weights and forward pass, which apparently breaks down as the number of active prefix bits grows. It’s like knowing that a proof exists for a theorem but being unable to find it within a reasonable time. The information is there; the access mechanism fails.
Effective prefix analysis: Fitting each model’s accuracy curve to an effective-prefix model reveals that Qwen3-30B-A3B-Thinking uses ≈47% of the revealed prefix (proportional scaling, ΔAIC=228), while 30B-A3B-Instruct uses only ≈15%. As depth grows, the oracle mask expands and limited prefix utilization pushes models toward the partial-information regime.
Table 1: Likelihood-based fit of LLM accuracy to an effective-prefix model
| Model | u (prop. scale) | v (capacity) | ΔAIC | Better fit |
|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | 0.15 | 0.00 | 2.21 | u (marginal) |
| Qwen3-30B-A3B-Thinking-2507 | 0.47 | 0.00 | 228.08 | u (strong) |
| Qwen3-4B-Instruct-2507 | 0.08 | 0.00 | 2.32 | u (marginal) |
| Qwen3-4B-Thinking-2507 | 0.05 | 0.00 | 0.00 | — |
u = effective fraction of prefix used (higher = better prefix integration). ΔAIC > 2 favors proportional scaling over constant capacity. Only Thinking-30B substantially uses its context window (u=0.47).
The authors fit two models to each LLM’s accuracy curve: (1) proportional scaling k=ug — the model uses a fraction u of the revealed prefix, and (2) constant capacity k=v — the model uses at most v terms regardless of how many are revealed. ΔAIC (Akaike Information Criterion difference) measures which model fits better; ΔAIC > 2 means the proportional model is meaningfully better. For Qwen3-30B-Thinking (u=0.47, ΔAIC=228), this strongly means it scales its prefix usage with depth — but only using ≈47%, not the full context. In contrast, 30B-Instruct (u=0.15, ΔAIC=2.21) barely scales at all.
Frontier LLMs are dramatically better — and tools are the key.
We evaluated ChatGPT (extended Thinking), Claude Opus 4.5 (max Thinking), and Gemini 3 Pro (Jan 2026) on 60 queries per model across g ∈ {31, 63, 127} with p=12, d=4. Half the prompts disallowed tool use (N.T.); the other half allowed it (T.).
Under the hardest conditions where all small LLMs fail at random:
Why tools help: Tool use externalizes computation. Instead of simultaneously discovering constraints and executing the implied computation in its internal weights, the model only specifies the constraints and delegates execution to an external program. This separation dramatically reduces the burden on the transformer’s weights, enabling robust generalization and stabilizing γ over long horizons.
Imagine asking an LLM to compute: “Given prefix [x₁, x₁ AND x₃, x₂ AND x₄], and 32 labeled examples, find the next monomial.” Without tools, the model must simultaneously (1) parse the prefix, (2) apply the XOR cancellation mask in its own attention heads, (3) search through payload variables, and (4) verify the candidate — all in one forward pass. With tools, the model can write code: compute_residuals(prefix, examples) and receive the cancellation result, then search for payload variables in a much simpler reasoning step. The computational load on the transformer’s weights drops dramatically, which is why tool-using models maintain high γ even at g=127.
This work provides a rigorous empirical test of the Diligent Learner hypothesis by introducing a GF(2) circuit reconstruction benchmark that is adversarial to common shortcut strategies. The task forces a model to maintain state and repeatedly fuse accumulated historical context with newly observed evidence at every step, rather than relying on shallow pattern matching.
Smaller language models exhibit a superlinear decline in γ as problem depth increases, effectively acting as partial-information estimators. They cannot preserve the prefix-conditioned cancellation required for continued progress — a fundamental limitation of their architecture.
Frontier models that leverage tool calls maintain high γ over long sequences by delegating state tracking and verification to external mechanisms. This suggests progress toward “superintelligence” depends less on scaling test-time compute and more on architectures that can build and use tools.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.