---
arxiv_id: 2602.21061
title: "Tool Building as a Path to "Superintelligence""
authors:
  - David Koplow
  - Tomer Galanti
  - Tomaso Poggio
difficulty: Advanced
tags:
  - LLM
  - Reasoning
  - Superintelligence
published_at: 2026-02-25
flecto_url: https://flecto.zer0ai.dev/papers/2602.21061/
lang: en
---

arXiv:2602.21061 · cs.AI · February 2026

David Koplow · Tomer Galanti · Tomaso Poggio

February 25, 2026

Can LLMs achieve superintelligence through reasoning alone — or do they need tools?

## Overview

The Diligent Learner framework proposes that LLMs can achieve superintelligence via test-time search, provided the stepwise success probability γ remains non-vanishingly positive. This paper introduces a benchmark to directly measure γ on logical out-of-distribution inference.

The benchmark is built around a class of GF(2) circuit reconstruction tasks that grow harder with each reasoning step and are information-theoretically impossible to shortcut: a model must simultaneously integrate its accumulated history and newly observed evidence at every step. Strategies that rely solely on pattern-matching or memorized history will fail by construction.

Analysis shows that while γ declines superlinearly with depth for small LLMs, frontier models exhibit partial robustness — but only when using tool calls. Successful reasoning at scale is contingent upon precise tool use, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

## What This Paper Delivers

A GF(2) Boolean circuit reconstruction problem for testing γ in the Diligent Learner framework, where the correct next step is unique and shortcuts are information-theoretically blocked.

Evaluation across Bayesian optimal estimators, small LLMs (Qwen3-2507 family), and state-of-the-art frontier models (ChatGPT, Claude Opus, Gemini 3 Pro).

Identification of trends in γ g collapse and an explanation for why certain models catastrophically fail as complexity increases, while tool-using models perform far better.

All code released on GitHub: github.com/Poggio-Lab/Tool-Building-as-a-Path-to-Superintelligence

## The Diligent Learner Framework

The Diligent Learner formalizes reasoning as a depth-first search through a tree of candidate steps , guided by a validator that checks whether each step is logically consistent. A root-to-leaf path corresponds to a chain-of-thought; leaves are either DONE (a completed solution) or BACKTRACK (an incorrect path to abandon).

The critical parameter is γ (gamma): the probability that at each depth g, the model proposes a “good” next step that keeps the solution on track. Formally:

If γ stays non-vanishing with depth, search succeeds with only polynomial overhead:

But if γ collapses as depth grows, the search budget blows up exponentially and the framework’s guarantees evaporate. The central question this paper addresses: does γ always stay positive, or can it catastrophically degrade?

## The Benchmark — GF(2) Circuit Reconstruction

### The Task: Reconstructing a Boolean Circuit Step by Step

The benchmark tests whether an LLM can reconstruct a Boolean function over GF(2) — a mathematical field where addition is XOR. The target circuit is expressed as a sum of monomials: f(a,v) = t₁ ⊕ t₂ ⊕ … ⊕ tₙ . At each step g, the model receives two inputs:

- Prefix (Pₒ) The g monomials discovered so far — the accumulated “history” of the circuit reconstruction.

- Evidence (Sₒ) 32 fresh labeled examples generated by a step-specific adversarial oracle.

The model must output the next monomial t g+1 . To succeed, it must fuse both inputs — history alone or data alone is provably insufficient (information-theoretically). The oracle statistically masks the answer unless the solver has the full prefix, making shortcut strategies fail by design.

#### Understanding ANF (Algebraic Normal Form)

ANF is a way to write any Boolean function as a sum (XOR) of products (AND) of variables. For example: f(x₁,x₂,x₃) = x₁ ⊕ (x₂ AND x₃) . Each product term is called a “monomial.” The benchmark task is essentially: given the first g monomials of a circuit, plus some labeled examples, predict the (g+1)-th monomial. This is like completing a polynomial — but over binary XOR arithmetic.

### Three Theoretical Guarantees

Knowing the sequence of prior steps provides zero predictive power about the next step when not accounting for the fresh evidence.

Labels are balanced (≈50% 0/1) so statistical frequency cannot reveal the answer. A model cannot trivially guess without real reasoning.

The fresh evidence without the prefix provides negligible signal. The Bayes advantage decays exponentially with the number of active prefix bits.

### Four Estimator Classes

Solvers are distinguished by their information access. The benchmark is designed so that min g γ A ≥ Q while γ B , γ C , γ D ≈ 1⁄𝐶 d−1 (near random for large depth).

Full access to both Prefix P and Evidence S. The ideal agent that integrates all available information.

Has Evidence S but not Prefix P. Cannot use the accumulated reasoning history.

Has Prefix P but not Evidence S. Cannot use the new step-specific data.

Partial access to both P and S. Degrades with depth, intermediate between B and A.

## Bayesian Estimator Simulations

Only history + data together maintain reliable next-step prediction.

We evaluated four Bayesian estimator classes across depths g ∈ {1, 3, 7, 15, 31, 63, 127} on 2,000 generated circuits (adversarial sampling, p=12, d=4). Results are unambiguous:

- Estimator A (Diligent): Maintains γ ≈ 1.0 across all depths — near-perfect performance.

- Estimator B (Data-only): Rapidly collapses toward random-guess baseline as depth grows.

- Estimator C (History-only): Performs at random-chance from the start — no useful signal.

- Estimator D (Partial): Degrades with depth, slightly better than B but still collapses.

As both reasoning depth g and problem size p increase, partial-information estimators converge to zero — confirming the information-theoretic design of the benchmark.

## Small LLMs Show Depth-Induced Collapse

Small LLMs degrade just like partial-information estimators.

We evaluated four models from the Qwen3-2507 family on 3,000 instances across depths g ∈ {1, 3, 7, 15, 31} with adversarial sampling (p=12, d=4): 4B-Instruct , 4B-Thinking , 30B-A3B-Thinking , 30B-A3B-Instruct .

All models exhibit systematic γ g decline with depth — even though a polynomial-time decoder provably exists at every step (Theorem B.1). “Thinking” variants perform better at shallow depths but still collapse sharply around g=15.

#### Why is this result counter-intuitive?

The paper proves in Appendix B (Theorem B.1) that a polynomial-time decoder always exists at every step — meaning, in principle, an algorithm can always find the correct next monomial efficiently. Yet even 30B-parameter models fail as depth increases. This disconnect reveals something important: having the right algorithm isn’t enough. The LLM must implement that algorithm through its attention weights and forward pass, which apparently breaks down as the number of active prefix bits grows. It’s like knowing that a proof exists for a theorem but being unable to find it within a reasonable time. The information is there; the access mechanism fails.

Effective prefix analysis: Fitting each model’s accuracy curve to an effective-prefix model reveals that Qwen3-30B-A3B-Thinking uses ≈47% of the revealed prefix (proportional scaling, ΔAIC=228), while 30B-A3B-Instruct uses only ≈15%. As depth grows, the oracle mask expands and limited prefix utilization pushes models toward the partial-information regime.

Table 1: Likelihood-based fit of LLM accuracy to an effective-prefix model

u = effective fraction of prefix used (higher = better prefix integration). ΔAIC > 2 favors proportional scaling over constant capacity. Only Thinking-30B substantially uses its context window (u=0.47).

#### Understanding the effective-prefix model and ΔAIC

The authors fit two models to each LLM’s accuracy curve: (1) proportional scaling k=ug — the model uses a fraction u of the revealed prefix, and (2) constant capacity k=v — the model uses at most v terms regardless of how many are revealed. ΔAIC (Akaike Information Criterion difference) measures which model fits better; ΔAIC > 2 means the proportional model is meaningfully better. For Qwen3-30B-Thinking (u=0.47, ΔAIC=228), this strongly means it scales its prefix usage with depth — but only using ≈47%, not the full context. In contrast, 30B-Instruct (u=0.15, ΔAIC=2.21) barely scales at all.

## Frontier LLMs — Tools Stabilize γ

Frontier LLMs are dramatically better — and tools are the key.

We evaluated ChatGPT (extended Thinking), Claude Opus 4.5 (max Thinking), and Gemini 3 Pro (Jan 2026) on 60 queries per model across g ∈ {31, 63, 127} with p=12, d=4. Half the prompts disallowed tool use ( N.T. ); the other half allowed it ( T. ).

Under the hardest conditions where all small LLMs fail at random:

- Frontier models with tools maintain γ ≈ 1.0 even at depth 127 .

- Without tools, γ drops substantially as problem size grows.

Why tools help: Tool use externalizes computation. Instead of simultaneously discovering constraints and executing the implied computation in its internal weights, the model only specifies the constraints and delegates execution to an external program. This separation dramatically reduces the burden on the transformer’s weights, enabling robust generalization and stabilizing γ over long horizons.

#### A concrete example of what “externalizing computation” means

Imagine asking an LLM to compute: “Given prefix [x₁, x₁ AND x₃, x₂ AND x₄], and 32 labeled examples, find the next monomial.” Without tools, the model must simultaneously (1) parse the prefix, (2) apply the XOR cancellation mask in its own attention heads, (3) search through payload variables, and (4) verify the candidate — all in one forward pass. With tools, the model can write code: compute_residuals(prefix, examples) and receive the cancellation result, then search for payload variables in a much simpler reasoning step. The computational load on the transformer’s weights drops dramatically, which is why tool-using models maintain high γ even at g=127.

## Conclusions and Implications

This work provides a rigorous empirical test of the Diligent Learner hypothesis by introducing a GF(2) circuit reconstruction benchmark that is adversarial to common shortcut strategies. The task forces a model to maintain state and repeatedly fuse accumulated historical context with newly observed evidence at every step, rather than relying on shallow pattern matching.

### Small LLMs: Structural Failure

Smaller language models exhibit a superlinear decline in γ as problem depth increases, effectively acting as partial-information estimators. They cannot preserve the prefix-conditioned cancellation required for continued progress — a fundamental limitation of their architecture.

### Frontier LLMs with Tools: A New Capability

Frontier models that leverage tool calls maintain high γ over long sequences by delegating state tracking and verification to external mechanisms. This suggests progress toward “superintelligence” depends less on scaling test-time compute and more on architectures that can build and use tools.

- Karl Cobbe et al. Training verifiers to solve math word problems, 2021. arXiv:2110.14168

- Simon Frieder et al. Data for mathematical copilots: Better ways of presenting proofs for machine learning, 2025. arXiv:2412.15184

- Yao Fu et al. Specializing smaller language models towards multi-step reasoning. ICML 2023.

- John Garrett et al. garrettj403/scienceplots: 2.1.1, 2023.

- Shibo Hao et al. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. NeurIPS 2023.

- Dan Hendrycks et al. Measuring massive multitask language understanding. ICLR 2021.

- Carlos Jimenez et al. SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024.

- Nirmit Joshi et al. A theory of learning with autoregressive chain of thought, 2025. arXiv:2503.07932

- Takeshi Kojima et al. Large language models are zero-shot reasoners. NeurIPS 2022.

- Woosuk Kwon et al. Efficient memory management for large language model serving with pagedattention. SOSP 2023.

- Xiao Liu et al. AgentBench: Evaluating LLMs as agents, 2025. arXiv:2308.03688

- Eran Malach. Auto-regressive next-token predictors are universal learners. ICML 2024.

- Maxwell Nye et al. Show your work: Scratchpads for intermediate computation with language models, 2021.

- Long Ouyang et al. Training language models to follow instructions with human feedback. NeurIPS 2022.

- Aaron Parisi et al. TALM: Tool augmented language models, 2022. arXiv:2205.12255

- Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. ICLR 2024.

- Timo Schick et al. Toolformer: Language models can teach themselves to use tools. NeurIPS 2023.

- Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective. arXiv:2507.15865 , 2025a.

- Shai Shalev-Shwartz and Amnon Shashua. From reasoning to super-intelligence: A search-theoretic perspective. arXiv:2507.15865 , 2025b.

- Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024.

- Zhengliang Shi et al. Tool learning in the wild: Empowering language models as automatic tool agents. WWW 2025.

- Noah Shinn et al. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366 , 2023.

- Mohit Shridhar et al. ALFWorld: Aligning text and embodied environments for interactive learning. ICLR 2021.

- Shivam Singhal et al. LLM-ERM: A probabilistic framework for in-context learning and text generation, 2025.

- Mirac Suzgun et al. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.

- Qwen Team. Qwen3 technical report, 2025. arXiv:2505.09388

- Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models, 2022.

- Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models, 2022.

- Chenxiao Yang et al. Chain-of-thought provably enables learning the (otherwise) unlearnable. ICLR 2025a.

- John Yang et al. SWE-bench multimodal: Do AI systems generalize to visual software domains? ICLR 2025b.

- Shunyu Yao et al. Tree of thoughts: Deliberate problem solving with large language models, 2023.

- Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents, 2024.
