Tool Building as a Path to "Superintelligence"

Abstract

Overview

The Diligent Learner framework proposes that LLMs can achieve superintelligence via test-time search, provided the stepwise success probability γ remains non-vanishingly positive. This paper introduces a benchmark to directly measure γ on logical out-of-distribution inference.

The benchmark is built around a class of GF(2) circuit reconstruction tasks that grow harder with each reasoning step and are information-theoretically impossible to shortcut: a model must simultaneously integrate its accumulated history and newly observed evidence at every step. Strategies that rely solely on pattern-matching or memorized history will fail by construction.

Analysis shows that while γ declines superlinearly with depth for small LLMs, frontier models exhibit partial robustness — but only when using tool calls. Successful reasoning at scale is contingent upon precise tool use, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

Frontier LLMs with tool calls maintain near-unity γ even at depth 127. Without tools, performance degrades substantially as problem size increases.

Contributions

What This Paper Delivers

1

Stepwise Reasoning Benchmark

A GF(2) Boolean circuit reconstruction problem for testing γ in the Diligent Learner framework, where the correct next step is unique and shortcuts are information-theoretically blocked.

2

Multi-Class Evaluation

Evaluation across Bayesian optimal estimators, small LLMs (Qwen3-2507 family), and state-of-the-art frontier models (ChatGPT, Claude Opus, Gemini 3 Pro).

3

γ Degradation Analysis

Identification of trends in γ_g collapse and an explanation for why certain models catastrophically fail as complexity increases, while tool-using models perform far better.

4

Open-Source Code

All code released on GitHub: github.com/Poggio-Lab/Tool-Building-as-a-Path-to-Superintelligence

Background

The Diligent Learner Framework

The Diligent Learner formalizes reasoning as a depth-first search through a tree of candidate steps, guided by a validator that checks whether each step is logically consistent. A root-to-leaf path corresponds to a chain-of-thought; leaves are either DONE (a completed solution) or BACKTRACK (an incorrect path to abandon).

The critical parameter is γ (gamma): the probability that at each depth g, the model proposes a “good” next step that keeps the solution on track. Formally:

Pr[a ∈ {∀aᵢ. S(h,aᵢ)=1}] ≥ γ

What this formula actually means: The policy π (the LLM's proposal distribution) assigns probability mass ≥ γ to “good” actions — actions that keep the solution path completable. Think of it like a chess player who, at each move, must have at least a γ chance of picking a move that doesn’t lead to a dead end. If γ = 0.9, the player still has good odds even across many moves. If γ = 0.1, the cumulative probability of reaching the solution collapses exponentially with depth.

If γ stays non-vanishing with depth, search succeeds with only polynomial overhead:

O(T_max · log(T_max/δ) / γ)

But if γ collapses as depth grows, the search budget blows up exponentially and the framework’s guarantees evaporate. The central question this paper addresses: does γ always stay positive, or can it catastrophically degrade?

γ = step-success probability Validator-guided DFS BACKTRACK mechanism

Diligent Learner proof tree visualization showing branching reasoning steps and BACKTRACK nodes — **Figure 1:** A proof tree from the Diligent Learner framework. Each node is a reasoning state; validator-accepted branches (probability ≥ γ) are extended; incorrect paths trigger BACKTRACK.

Diligent Learner as validator-guided DFS flow diagram — **Figure 2:** The Diligent Learner as validator-guided DFS. Good extensions occur with probability ≥ γ; on failure, the policy backtracks to the deepest correct prefix β(h) and continues search.

Methodology

The Benchmark — GF(2) Circuit Reconstruction

The Task: Reconstructing a Boolean Circuit Step by Step

The benchmark tests whether an LLM can reconstruct a Boolean function over GF(2) — a mathematical field where addition is XOR. The target circuit is expressed as a sum of monomials: f(a,v) = t₁ ⊕ t₂ ⊕ … ⊕ tₙ. At each step g, the model receives two inputs:

GF(2) in plain terms: GF(2) is the simplest mathematical field — it has only two elements, 0 and 1, and uses XOR (⊕) as addition. For example: 1 ⊕ 1 = 0, 0 ⊕ 1 = 1. Multiplication is still regular AND. This is exactly the arithmetic used in binary systems and cryptography. The paper uses GF(2) because XOR operations have clean mathematical properties that enable rigorous proofs about what information shortcuts can and cannot reveal.

Prefix (Pₒ) The g monomials discovered so far — the accumulated “history” of the circuit reconstruction.
Evidence (Sₒ) 32 fresh labeled examples generated by a step-specific adversarial oracle.

The model must output the next monomial t_g+1. To succeed, it must fuse both inputs — history alone or data alone is provably insufficient (information-theoretically). The oracle statistically masks the answer unless the solver has the full prefix, making shortcut strategies fail by design.

Understanding ANF (Algebraic Normal Form)

ANF is a way to write any Boolean function as a sum (XOR) of products (AND) of variables. For example: f(x₁,x₂,x₃) = x₁ ⊕ (x₂ AND x₃). Each product term is called a “monomial.” The benchmark task is essentially: given the first g monomials of a circuit, plus some labeled examples, predict the (g+1)-th monomial. This is like completing a polynomial — but over binary XOR arithmetic.

Three Theoretical Guarantees

1

No History-Only Shortcuts

Knowing the sequence of prior steps provides zero predictive power about the next step when not accounting for the fresh evidence.

2

No Statistical Leakage

Labels are balanced (≈50% 0/1) so statistical frequency cannot reveal the answer. A model cannot trivially guess without real reasoning.

3

No Data-Only Shortcuts

The fresh evidence without the prefix provides negligible signal. The Bayes advantage decays exponentially with the number of active prefix bits.

“Information-theoretically impossible to shortcut” — what does this mean? Information theory sets absolute limits on what can be inferred from data. If the mutual information between the prefix and the next step (without conditioning on the evidence) is zero, then no amount of cleverness can extract the answer from history alone — not even a perfect Bayesian calculator. This is not a practical limitation of current models; it’s a mathematical impossibility. The benchmark is designed so these impossibility guarantees hold by construction, using the adversarial oracle that masks labels with the prefix.

Four Estimator Classes

Solvers are distinguished by their information access. The benchmark is designed so that min_g γ^A ≥ Q while γ^B, γ^C, γ^D ≈ 1⁄𝐶^d−1 (near random for large depth).

𝒜

Diligent Estimator

Full access to both Prefix P and Evidence S. The ideal agent that integrates all available information.

γ ≈ 1.0 (sustained)

ℬ

Data-Only Estimator

Has Evidence S but not Prefix P. Cannot use the accumulated reasoning history.

γ → random (collapses)

𝒞

History-Only Estimator

Has Prefix P but not Evidence S. Cannot use the new step-specific data.

γ ≈ random (always)

𝒟

Partial Estimator

Partial access to both P and S. Degrades with depth, intermediate between B and A.

γ → random (degrades)

Results — Section 6.1

Bayesian Estimator Simulations

Only history + data together maintain reliable next-step prediction.

We evaluated four Bayesian estimator classes across depths g ∈ {1, 3, 7, 15, 31, 63, 127} on 2,000 generated circuits (adversarial sampling, p=12, d=4). Results are unambiguous:

Estimator A (Diligent): Maintains γ ≈ 1.0 across all depths — near-perfect performance.
Estimator B (Data-only): Rapidly collapses toward random-guess baseline as depth grows.
Estimator C (History-only): Performs at random-chance from the start — no useful signal.
Estimator D (Partial): Degrades with depth, slightly better than B but still collapses.

As both reasoning depth g and problem size p increase, partial-information estimators converge to zero — confirming the information-theoretic design of the benchmark.

Line chart: step-success probability gamma vs reasoning depth for 4 estimator classes — **Figure 3:** Step-success probability γ_g vs reasoning depth g for each estimator class (p=12, d=4, 2000 circuits, adversarial sampling). Estimator A (Diligent, history+data) maintains γ ≈ 1 across all depths. Estimators B (data-only) and D (partial) collapse toward zero; Estimator C (history-only) remains at random-chance throughout.

Heatmap: gamma across problem size p and depth g for 4 estimator classes — **Figure 4:** γ_g heatmap across problem size p (y-axis) and depth g (x-axis) for each estimator class (adversarial data, 200 circuits per setting). Estimator A remains bright (high γ) throughout. All other estimators darken as both g and p increase — confirming that collapse scales with problem complexity.

Results — Section 6.2

Small LLMs Show Depth-Induced Collapse

Small LLMs degrade just like partial-information estimators.

We evaluated four models from the Qwen3-2507 family on 3,000 instances across depths g ∈ {1, 3, 7, 15, 31} with adversarial sampling (p=12, d=4): 4B-Instruct, 4B-Thinking, 30B-A3B-Thinking, 30B-A3B-Instruct.

All models exhibit systematic γ_g decline with depth — even though a polynomial-time decoder provably exists at every step (Theorem B.1). “Thinking” variants perform better at shallow depths but still collapse sharply around g=15.

Why is this result counter-intuitive?

The paper proves in Appendix B (Theorem B.1) that a polynomial-time decoder always exists at every step — meaning, in principle, an algorithm can always find the correct next monomial efficiently. Yet even 30B-parameter models fail as depth increases. This disconnect reveals something important: having the right algorithm isn’t enough. The LLM must implement that algorithm through its attention weights and forward pass, which apparently breaks down as the number of active prefix bits grows. It’s like knowing that a proof exists for a theorem but being unable to find it within a reasonable time. The information is there; the access mechanism fails.

Effective prefix analysis: Fitting each model’s accuracy curve to an effective-prefix model reveals that Qwen3-30B-A3B-Thinking uses ≈47% of the revealed prefix (proportional scaling, ΔAIC=228), while 30B-A3B-Instruct uses only ≈15%. As depth grows, the oracle mask expands and limited prefix utilization pushes models toward the partial-information regime.

Line chart: gamma vs depth for Qwen3-2507 small LLMs showing depth-induced collapse — **Figure 5:** Small LLMs exhibit depth-induced collapse. Step-success probability γ_g vs circuit depth g for Qwen3-2507 models (adversarial sampling, p=12, d=4, 3000 instances). Despite a polynomial-time decoder existing at every step, all models degrade with depth; larger “Thinking” variants help at shallow depths but approach the trivial baseline γ_triv.

Table 1: Likelihood-based fit of LLM accuracy to an effective-prefix model

Model	u (prop. scale)	ΔAIC	Better fit
Qwen3-30B-A3B-Instruct-2507	0.15	2.21	u (marginal)
Qwen3-30B-A3B-Thinking-2507	0.47	228.08	u (strong)
Qwen3-4B-Instruct-2507	0.08	2.32	u (marginal)
Qwen3-4B-Thinking-2507	0.05	0.00	—

u = effective fraction of prefix used (higher = better prefix integration). ΔAIC > 2 favors proportional scaling over constant capacity. Only Thinking-30B substantially uses its context window (u=0.47).

Understanding the effective-prefix model and ΔAIC

The authors fit two models to each LLM’s accuracy curve: (1) proportional scaling k=ug — the model uses a fraction u of the revealed prefix, and (2) constant capacity k=v — the model uses at most v terms regardless of how many are revealed. ΔAIC (Akaike Information Criterion difference) measures which model fits better; ΔAIC > 2 means the proportional model is meaningfully better. For Qwen3-30B-Thinking (u=0.47, ΔAIC=228), this strongly means it scales its prefix usage with depth — but only using ≈47%, not the full context. In contrast, 30B-Instruct (u=0.15, ΔAIC=2.21) barely scales at all.

Results — Section 6.3

Frontier LLMs — Tools Stabilize γ

Frontier LLMs are dramatically better — and tools are the key.

We evaluated ChatGPT (extended Thinking), Claude Opus 4.5 (max Thinking), and Gemini 3 Pro (Jan 2026) on 60 queries per model across g ∈ {31, 63, 127} with p=12, d=4. Half the prompts disallowed tool use (N.T.); the other half allowed it (T.).

Under the hardest conditions where all small LLMs fail at random:

Frontier models with tools maintain γ ≈ 1.0 even at depth 127.
Without tools, γ drops substantially as problem size grows.

Why tools help: Tool use externalizes computation. Instead of simultaneously discovering constraints and executing the implied computation in its internal weights, the model only specifies the constraints and delegates execution to an external program. This separation dramatically reduces the burden on the transformer’s weights, enabling robust generalization and stabilizing γ over long horizons.

A concrete example of what “externalizing computation” means

Imagine asking an LLM to compute: “Given prefix [x₁, x₁ AND x₃, x₂ AND x₄], and 32 labeled examples, find the next monomial.” Without tools, the model must simultaneously (1) parse the prefix, (2) apply the XOR cancellation mask in its own attention heads, (3) search through payload variables, and (4) verify the candidate — all in one forward pass. With tools, the model can write code: compute_residuals(prefix, examples) and receive the cancellation result, then search for payload variables in a much simpler reasoning step. The computational load on the transformer’s weights drops dramatically, which is why tool-using models maintain high γ even at g=127.

γ ≈ 1.0 at g=127 with tools (ChatGPT)

Bar chart comparing frontier LLM performance with and without tools at g=31 — **Figure 6:** Frontier LLMs have much higher γ_g than small models at g=31. Small Qwen3 models fall at or below the random-guess baseline. Frontier models with tool calls (T.) achieve near-unity γ; without tools (N.T.) performance is substantially lower.

Bayesian confidence intervals in Figure 7: The error bars represent Bayesian credible intervals using a uniform prior on γ. With only 60 queries per model (vs. 3000 for small LLMs), standard frequentist intervals would be unreliable. The Bayesian approach provides better-calibrated uncertainty for small samples. Wider bars = higher uncertainty.

Bar chart comparing frontier LLM performance with and without tools at depth 63 and 127 — **Figure 7:** Tool-using frontier models maintain high γ_g even at depths 63 and 127. Without tools (N.T.), γ_g drops substantially. ChatGPT with tools remains near unity at g=127. Opus often used tools even when instructed not to, inflating its no-tool score.

Discussion

Conclusions and Implications

This work provides a rigorous empirical test of the Diligent Learner hypothesis by introducing a GF(2) circuit reconstruction benchmark that is adversarial to common shortcut strategies. The task forces a model to maintain state and repeatedly fuse accumulated historical context with newly observed evidence at every step, rather than relying on shallow pattern matching.

Small LLMs: Structural Failure

Smaller language models exhibit a superlinear decline in γ as problem depth increases, effectively acting as partial-information estimators. They cannot preserve the prefix-conditioned cancellation required for continued progress — a fundamental limitation of their architecture.

Frontier LLMs with Tools: A New Capability

Frontier models that leverage tool calls maintain high γ over long sequences by delegating state tracking and verification to external mechanisms. This suggests progress toward “superintelligence” depends less on scaling test-time compute and more on architectures that can build and use tools.

The path to general superintelligence in the Diligent Learner framework runs through tool design and tool use — not raw parameter scaling.

References (30+ citations)

Karl Cobbe et al. Training verifiers to solve math word problems, 2021. arXiv:2110.14168
Simon Frieder et al. Data for mathematical copilots: Better ways of presenting proofs for machine learning, 2025. arXiv:2412.15184
Yao Fu et al. Specializing smaller language models towards multi-step reasoning. ICML 2023.
John Garrett et al. garrettj403/scienceplots: 2.1.1, 2023.
Shibo Hao et al. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. NeurIPS 2023.
Dan Hendrycks et al. Measuring massive multitask language understanding. ICLR 2021.
Carlos Jimenez et al. SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024.
Nirmit Joshi et al. A theory of learning with autoregressive chain of thought, 2025. arXiv:2503.07932
Takeshi Kojima et al. Large language models are zero-shot reasoners. NeurIPS 2022.
Woosuk Kwon et al. Efficient memory management for large language model serving with pagedattention. SOSP 2023.
Xiao Liu et al. AgentBench: Evaluating LLMs as agents, 2025. arXiv:2308.03688
Eran Malach. Auto-regressive next-token predictors are universal learners. ICML 2024.
Maxwell Nye et al. Show your work: Scratchpads for intermediate computation with language models, 2021.
Long Ouyang et al. Training language models to follow instructions with human feedback. NeurIPS 2022.
Aaron Parisi et al. TALM: Tool augmented language models, 2022. arXiv:2205.12255
Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR 2024.
Timo Schick et al. Toolformer: Language models can teach themselves to use tools. NeurIPS 2023.
Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective. arXiv:2507.15865, 2025a.
Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective. arXiv:2507.15865, 2025b.
Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024.
Zhengliang Shi et al. Tool learning in the wild: Empowering language models as automatic tool agents. WWW 2025.
Noah Shinn et al. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366, 2023.
Mohit Shridhar et al. ALFWorld: Aligning text and embodied environments for interactive learning. ICLR 2021.
Shivam Singhal et al. LLM-ERM: A probabilistic framework for in-context learning and text generation, 2025.
Mirac Suzgun et al. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
Qwen Team. Qwen3 technical report, 2025. arXiv:2505.09388
Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models, 2022.
Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models, 2022.
Chenxiao Yang et al. Chain-of-thought provably enables learning the (otherwise) unlearnable. ICLR 2025a.
John Yang et al. SWE-bench multimodal: Do AI systems generalize to visual software domains? ICLR 2025b.
Shunyu Yao et al. Tree of thoughts: Deliberate problem solving with large language models, 2023.
Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents, 2024.