← FlectoπŸ€– Agent Ready
arXiv 2026 cs.AR Industrial AI

InCoder-32B-Thinking

Industrial Code World Model for Thinking

Jian Yang, Wei Zhang, Jiajun Wu et al. β€” Beihang University, IQuest Research, Shanghai Jiao Tong University

81.3%LiveCodeBench V5
84.0%CAD-Coder
38.0%KernelBench

Industrial software development β€” spanning chip design, GPU kernel optimization, and embedded systems β€” has long lacked the kind of expert reasoning traces that show how engineers think through hardware constraints and timing semantics. Unlike general coding, where test cases and linters give quick feedback, industrial code requires understanding physical execution environments: does this Triton kernel exceed the GPU's shared memory limit? Will this Verilog module synthesize correctly under timing constraints?

InCoder-32B-Thinking addresses this gap through two synergistic components: Error-driven Chain-of-Thought (ECoT) synthesis that generates reasoning traces by learning from execution errors, and an Industrial Code World Model (ICWM) that predicts hardware execution outcomes without invoking real backends β€” enabling scalable generation of high-quality training data. The result is a 32B parameter model that achieves top-tier open-source performance across both general and industrial code benchmarks.

81.3%LiveCodeBench V5
84.0%CAD-Coder Compile Pass
38.0%KernelBench L2
96.7%ICWM Prediction Accuracy

1. Introduction

The Industrial Code Gap

Large language models have made remarkable progress in general software engineering β€” writing functions, debugging scripts, and passing competitive programming benchmarks. Yet industrial code domains tell a very different story. Leading models achieve limited success on Triton kernel generation and Verilog equivalence checking, despite strong general performance.

The root cause is a data gap: industrial domains lack the expert reasoning traces that show how engineers reason through hardware constraints, timing semantics, and domain-specific execution feedback. A Verilog fix isn't just about syntax β€” it requires understanding how the RTL maps to gate-level logic and timing paths. A Triton kernel requires reasoning about GPU memory hierarchy, warp scheduling, and numerical precision.

Our Approach

InCoder-32B-Thinking combines thinking model capabilities (deliberate, error-correcting multi-turn reasoning) with industrial code world model knowledge (causal dynamics of how code affects hardware behavior). The model is trained on data generated by the ECoT synthesis framework and validated by the ICWM, creating a virtuous cycle of execution-grounded reasoning.

Key Contributions

βš™οΈ

ECoT Synthesis

Error-driven Chain-of-Thought generates reasoning traces through multi-turn dialogue with real execution environments. Errors from GPU compilers, RTL simulators, and CAD engines become the training signal.

πŸ”¬

Industrial Code World Model

ICWM learns the causal dynamics of code→hardware behavior from execution traces. It replaces expensive real backends during large-scale data generation, achieving 96.7% outcome prediction accuracy.

πŸ†

Top-tier Performance

Achieves leading open-source results on 14 general benchmarks (81.3% LiveCodeBench V5) and 9 industrial benchmarks (84.0% CAD-Coder, 38.0% KernelBench L2), outperforming Claude-Sonnet-4.6 and Kimi-K2.5 on industrial tasks.

Benchmark comparison bar chart
Figure 2: Performance of InCoder-32B-Thinking compared to Claude-Sonnet-4.6, Kimi-K2.5, and Qwen3.5-397B-A17B across 10 benchmarks spanning general and industrial code domains.

2. ECoT Synthesis & Industrial Code World Model

The core methodology operates in two phases: grounded collection β€” where a frontier LLM generates reasoning traces validated by real execution backends β€” and ICWM-driven amplification β€” where a trained world model replaces expensive toolchain invocations to scale up data generation efficiently.

What is ECoT?

Traditional fine-tuning teaches a model the correct answer. ECoT teaches the reasoning process β€” making mistakes, reading error feedback, and correcting course. This mirrors how expert engineers work: write code, run it, read the error, understand why, fix it, repeat.

The key insight: error messages from compilers and simulators are rich training signals. A CUDA kernel memory fault tells you exactly which memory access pattern was wrong. ECoT harvests these domain-specific error signals as supervision data.

ECoT data synthesis pipeline
Figure 4: The ECoT data synthesis pipeline. Domain tasks are seeded with execution environments, reasoning traces are elicited via a frontier LLM, real backends (GPU compilers, RTL simulators, CAD engines) validate each code revision, and the resulting multi-turn trajectories train the ICWM.

Error-driven Chain-of-Thought (ECoT) Pipeline

01

Task Seeding & Environment Bundling

Domain tasks from chip design, GPU optimization, 3D modeling, and embedded systems are paired with their required execution environments: Verilog modules bundled with testbenches and synthesis constraints, CUDA kernels with memory profiles, CadQuery scripts with geometry validation tests.

02

Execution-Grounded Trajectory Synthesis

A frontier LLM generates a reasoning trace and candidate code. The code is sent to domain-specific real backends: Triton/CUDA for GPU kernels, Renode for microcontroller firmware, CadQuery for 3D geometry, Yosys/Icarus for RTL. Each backend returns an outcome label (PASS, COMPILATION_ERROR, MEMORY_FAULT) plus diagnostic logs. Errors become the next-turn observation, driving iterative refinement.

03

Multi-turn Trajectory Formation

Both successful and failed intermediate turns are retained, so training data captures common failure modes alongside the reasoning steps that resolve them. The trajectory T = [(S_init, r⁽⁰⁾, c⁽⁰⁾) β†’ (r⁽¹⁾, c⁽¹⁾) β†’ ... β†’ (r⁽ᡏ⁾, c⁽ᡏ⁾)] forms the core of the ECoT dataset.

Understanding the Trajectory Formula

The trajectory T = [(S_init, r(0), c(0)) β†’ (r(1), c(1)) β†’ ... β†’ (r(k), c(k))] represents a conversation-like sequence where: S_init is the task + environment, r(n) is the reasoning trace at turn n, c(n) is the candidate code. Each arrow represents one round of execution feedback: the code runs, something fails, the failure becomes the next input.

Crucially, both failed and successful intermediate turns are kept as training data. This teaches the model that errors are part of the process β€” not noise to be minimized.

Industrial Code World Model (ICWM)

Real execution backends provide reliable supervision but are expensive β€” each interaction requires invoking domain-specific toolchains. To scale up data generation, we train an Industrial Code World Model (ICWM): a language model that predicts what a real backend would return, given the execution environment and candidate code:

ICWMe : (Senv, c(k)) β†’ Γ΄(k)

Where the predicted output ô⁽ᡏ⁾ includes an outcome label (PASS, COMPILATION_ERROR, etc.), a diagnostic message, and numerical outputs or diff summaries. The ICWM is trained on all collected execution trajectories. Once trained, it replaces real backends in the feedback loop β€” each prediction is a single forward pass rather than a real compilation or simulation, enabling 100Γ— faster trajectory generation.

⚑100Γ— faster than real backend invocation β€” single forward pass per prediction step
βœ…All ICWM trajectories verified against real execution β€” final corpus D = D_real βˆͺ D_icwm

Why does ICWM exist?

Running real industrial toolchains is expensive. Invoking a Verilog synthesis tool (Yosys) takes minutes. Running a GPU kernel and profiling memory access requires actual hardware. If you need millions of training trajectories, you cannot invoke real backends for each step.

The ICWM solves this by predicting what the real backend would return β€” as a language model, it is just a forward pass. Analogous to how AlphaZero uses a neural network to simulate chess positions rather than playing every position on a real board. ICWM achieves 96.7% accuracy at predicting real execution outcomes.

Pipeline concept illustration
Conceptual illustration: ECoT synthesis pipeline connecting GPU chips, RTL schematics, and embedded systems through a reasoning LLM to generate verified training trajectories.

Thinking in Action: CUDA Kernel Example

What Went Wrong in the CUDA Example?

The task: compute Hinge Loss where predictions p has shape (32768, 32768) β€” a 2D matrix β€” and targets t has shape (32768,) β€” a 1D vector. The base model used a flat 1D index predictions[idx], treating the 2D matrix as if it were 1D. This reads wrong memory across row boundaries.

The thinking model's fix: use a grid-stride loop with proper 2D decomposition β€” compute batch_idx = i / input_size to determine which row, then index targets[batch_idx] instead of targets[idx]. This correctly broadcasts the 1D targets across all columns of the 2D predictions matrix.

CUDA kernel comparison: InCoder-32B vs InCoder-32B-Thinking
Figure 3: Implementing a CUDA Hinge Loss kernel. Without thinking, InCoder-32B produces incorrect code (shape mismatch: 2D predictions indexed as 1D). With ECoT training, InCoder-32B-Thinking systematically identifies the mismatch, infers broadcasting semantics, maps flat indices to row indices, and generates correct grid-stride loop code.

3. Evaluation

We evaluate InCoder-32B-Thinking on a comprehensive suite covering both general-purpose coding and specialized industrial domains, comparing against contemporary models including Claude-Sonnet-4.6, Kimi-K2.5, Qwen3.5-397B-A17B, and DeepSeek-V3.

General Code Benchmarks

  • LiveCodeBench V5/V6
  • CruxEval (Input/Output COT)
  • Mercury (code efficiency)
  • Bird / Spider (Text2SQL)
  • Terminal-Bench v1/v2
  • SWE-bench Verified
  • Mind2Web, BFCL V3

Industrial Code Benchmarks

  • RealBench (Verilog module synthesis)
  • ArchXBench (RTL architecture)
  • CAD-Coder (chip CAD)
  • VeriScope / VeriRepair
  • EmbedICGen (embedded code)
  • SuperCoder Ace
  • TritonBench / KernelBench (GPU)

General Code Results

81.3%LiveCodeBench V5
83.5%LiveCodeBench V6
53.3CruxEval Input-COT
74.8%SWE-Verified

The most striking finding is the leap in code reasoning. InCoder-32B-Thinking scores 81.3 on LiveCodeBench V5 β€” comparable to proprietary frontier models β€” and 53.3 on CruxEval Input-COT, demonstrating that thinking augmentation dramatically improves multi-step reasoning over code execution traces.

Industrial Code Results

78.6%RealBench Syn@1
84.0%CAD-Coder
29.8%TritonBench G-call
38.0%KernelBench L2

On industrial benchmarks, InCoder-32B-Thinking consistently outperforms competing models. It achieves 84.0% on CAD-Coder Compile Pass (vs. Claude-Sonnet-4.6's 22.2%), 82.0% on VeriScope Score, 78.6% on RealBench Module Syn@1 (vs. Kimi-K2.5's 50.1%), and 38.0% on KernelBench L2. These results validate that ECoT + ICWM training transfers effectively across all industrial coding sub-domains.

Chip Design Benchmark Results

ModelSizeVeriScope ScoreVeriRepair Fix%Sys Syn@1Sys Syn@5Module Syn@1Module Syn@5
Qwen3-Coder3.6/21B73.986.73.817.422.947.9
Qwen3.517/397B62.586.711.238.135.259.5
Kimi-K2.532B/1T82.476.76.226.250.170.1
Claude-Sonnet-4.6β€”83.290.02.511.222.243.4
InCoder-32B-Thinking32B82.053.535.291.082.0β€”

GPU Optimization Benchmark Results

What is TritonBench and KernelBench?

Triton is OpenAI's Python-based GPU programming language for writing high-performance GPU kernels. TritonBench measures whether LLMs can write correct Triton code that: (G-call) generates runnable kernels, and (G-exe) produces kernels that execute correctly on GPU hardware.

KernelBench measures whether models can replace PyTorch operators with custom CUDA/Triton kernels that run at comparable or better speed (L1: 10% speedup, L2: 50% speedup, L3: 100% speedup over PyTorch baseline). InCoder-32B-Thinking scores 38.0% on KernelBench L2 β€” meaning its kernels achieve 1.5Γ— PyTorch speedup 38% of the time.

ModelSizeTritonBench G-call%TritonBench G-exe%KernelBench L1KernelBench L2
Qwen3.517/397B7.6100.04.010.0
Kimi-K2.532B/1T17.4100.09.116.0
Claude-Sonnet-4.6β€”1.6100.016.223.0
InCoder-32B-Thinking32B18.5100.022.238.0
InCoder-32B-Thinking overview: Reflective Depth Reasoning and Domain Reasoning
Figure 1: Overview of InCoder-32B-Thinking capabilities. Left: Reflective Depth Reasoning β€” iterative error-correction through multi-turn GPU kernel debugging (FAIL β†’ think β†’ PASS). Right: Domain Reasoning β€” hardware-aware reasoning chains for industrial code tasks. Center: the model bridges general and industrial code intelligence.

4. Analysis

4.1 ICWM Fidelity Analysis

A core assumption of the pipeline is that ICWM can faithfully replace real execution backends during large-scale trajectory synthesis. We validate this by holding out 2,000 execution turns from each industrial domain and measuring how closely ICWM predictions match real backend results β€” across both per-turn outcome labels and end-to-end trajectory verdicts.

What Do the ICWM Fidelity Numbers Mean?

Outcome Prediction Accuracy (96.7%): On held-out execution turns, ICWM correctly predicts the per-step outcome label (PASS / COMPILATION_ERROR / MEMORY_FAULT) 96.7% of the time. This measures single-step accuracy.

Trajectory Agreement (94.4%): Whether the entire multi-turn trajectory ends with the same final verdict as real execution. Always lower than per-step accuracy because errors can compound over multiple turns.

ICWM fidelity across industrial domains
Figure 5: ICWM fidelity across five industrial domains. Purple bars show per-turn outcome prediction accuracy; orange bars show end-to-end trajectory agreement with real execution. All domains exceed 93% on both metrics.
GPU Kernels96.8% / 94.3%
Chip Design97.4% / 95.8%
3D Modeling95.9% / 93.1%
Code Optim.97.1% / 95.2%
Embedded Sys.96.2% / 93.7%
Mean96.7% / 94.4%

Format: Outcome Prediction Accuracy / Trajectory Agreement. Chip design achieves highest fidelity (97.4%/95.8%); 3D modeling has the widest gap due to floating-point tolerance complexity in CadQuery geometry checks.

4.2 Adaptive Thinking Depth

A key property of InCoder-32B-Thinking is that it allocates reasoning compute proportional to task complexity. Analysis of the training corpus reveals a 209Γ— range in thinking token lengths across task categories, driven by the real execution feedback β€” not by a fixed prompt template.

Thinking depth distribution by task category
Figure 6: Distribution of thinking block lengths (median and interquartile range P25–P75) per task category, sorted by thinking depth. Industrial domains (highlighted) consistently require deeper reasoning than general coding tasks.
19K chars GPU Optimization β€” median thinking length, requiring multiple hardware constraint analyses per correction
209Γ— Range of thinking depth across task types β€” from agentic coding (shortest) to GPU optimization (longest)
91 chars Agentic Coding β€” shortest thinking chains, reflecting clear state machine structure of tool-use tasks

Why Does GPU Optimization Need 19K Character Thinking Chains?

GPU kernel optimization requires reasoning about multiple interacting hardware constraints simultaneously: L1 cache limits (32KB per SM), warp scheduling (all 32 threads in a warp execute together), memory access coalescing (128-byte cache lines), shared memory bank conflicts, register pressure vs. thread-block size tradeoffs. Each correction requires re-evaluating all these constraints. The 19K character median reflects genuine multi-step hardware reasoning β€” not verbosity.

By contrast, agentic coding tasks (tool use, file operations) have clear state machines: observe, decide, act. The path to the answer is short and well-defined.

4.3 Effects of Thinking Training Data

To understand how training data scale affects performance, we trained checkpoints at 180M, 360M, and 540M tokens of thinking data. Across 9 industrial benchmarks, consistent improvement is observed as data scales β€” with TritonBench GPU execution correctness maintaining a perfect 100% across all stages, indicating that some capabilities emerge early and remain stable. The thinking mechanism consistently adds value beyond the base InCoder-32B model.

Performance vs thinking training data scale
Figure 7: Performance across 9 industrial benchmarks as thinking training data scales from 180M to 540M tokens. Most metrics improve monotonically; TritonBench GPU execution correctness plateaus at 100% across all stages.

5. Conclusion

InCoder-32B-Thinking demonstrates that the gap between general code intelligence and industrial software development can be bridged through execution-grounded thinking data. By combining Error-driven Chain-of-Thought synthesis with an Industrial Code World Model, the framework creates training data that captures the real reasoning depth required for industrial code tasks:

  • ECoT synthesis generates high-quality reasoning traces by learning from multi-turn execution errors β€” no human annotation required.
  • ICWM achieves 96.7% outcome prediction accuracy, enabling scalable trajectory generation without expensive real-backend invocations.
  • Adaptive thinking depth (209Γ— range) reflects real task complexity β€” GPU optimization demands 19K character reasoning chains vs. 91 chars for agentic coding.
  • Top-tier results on 14 general + 9 industrial benchmarks, consistently outperforming Claude-Sonnet-4.6, Kimi-K2.5, and Qwen3.5-397B-A17B on industrial tasks.
References (Selected)
  1. Agarwal et al. GPT-OSS-120B model card. arXiv, 2026.
  2. Ahmad et al. OpenCoderReasoning. 2025.
  3. Anthropic. Introducing claude sonnet 4.6, 2026.
  4. Chen et al. Codex. arXiv, 2021.
  5. DeepSeek-AI. DeepSeek-V3.2. 2026.
  6. Yang et al. InCoder-32B. 2025.
  7. Zheng et al. CodeGeeX. 2023.
  8. ... (full reference list in arXiv paper)

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages β€” we turn any source material into production-quality content. Rich HTML Β· Custom slides Β· Animated video.

View Services Contact Us