---
arxiv_id: 2604.03144
title: "InCoder-32B-Thinking: Industrial Code World Model for Thinking"
authors:
  - Jian Yang
  - Wei Zhang
  - Jiajun Wu
  - Junhang Cheng
  - Tuney Zheng
difficulty: 
tags:
  []
published_at: 2026-04-07
flecto_url: https://flecto.zer0ai.dev/papers/2604.03144/
lang: en
---

## Page Title

### InCoder-32B-Thinking: Industrial Code World Model for Thinking

## Meta Description

InCoder-32B-Thinking achieves top-tier performance on 14 general and 9 industrial code benchmarks using Error-driven Chain-of-Thought synthesis and an Industrial Code World Model.

## Hero, Element=H1

### InCoder-32B-Thinking

## Hero, Element=Subtitle

### Industrial Code World Model for Thinking

## Hero, Element=Authors

Jian Yang, Wei Zhang, Jiajun Wu et al. — Beihang University, IQuest Research, Shanghai Jiao Tong University

## Hero, Element=Arxiv Btn

### Read on arXiv ↗

## Abstract, Element=Section Label

### Abstract

## Abstract, Element=P1

Industrial software development — spanning chip design, GPU kernel optimization, and embedded systems — has long lacked the kind of expert reasoning traces that show how engineers think through hardware constraints and timing semantics. Unlike general coding, where test cases and linters give quick feedback, industrial code requires understanding physical execution environments: does this Triton kernel exceed the GPU's shared memory limit? Will this Verilog module synthesize correctly under timing constraints?

## Abstract, Element=P2

InCoder-32B-Thinking addresses this gap through two synergistic components: Error-driven Chain-of-Thought (ECoT) synthesis that generates reasoning traces by learning from execution errors, and an Industrial Code World Model (ICWM) that predicts hardware execution outcomes without invoking real backends — enabling scalable generation of high-quality training data. The result is a 32B parameter model that achieves top-tier open-source performance across both general and industrial code benchmarks.

## Abstract, Metric Label

### LiveCodeBench V5

### CAD-Coder Compile Pass

### KernelBench L2

### ICWM Prediction Accuracy

## Introduction, Element=H2

### 1. Introduction

## Introduction, Element=Problem H3

### The Industrial Code Gap

## Introduction, Element=Problem P1

Large language models have made remarkable progress in general software engineering — writing functions, debugging scripts, and passing competitive programming benchmarks. Yet industrial code domains tell a very different story. Leading models achieve limited success on Triton kernel generation and Verilog equivalence checking, despite strong general performance.

## Introduction, Element=Problem P2

The root cause is a data gap : industrial domains lack the expert reasoning traces that show how engineers reason through hardware constraints, timing semantics, and domain-specific execution feedback. A Verilog fix isn't just about syntax — it requires understanding how the RTL maps to gate-level logic and timing paths. A Triton kernel requires reasoning about GPU memory hierarchy, warp scheduling, and numerical precision.

## Introduction, Element=Solution H3

### Our Approach

## Introduction, Element=Solution P

InCoder-32B-Thinking combines thinking model capabilities (deliberate, error-correcting multi-turn reasoning) with industrial code world model knowledge (causal dynamics of how code affects hardware behavior). The model is trained on data generated by the ECoT synthesis framework and validated by the ICWM, creating a virtuous cycle of execution-grounded reasoning.

## Introduction, Element=Contributions H3

### Key Contributions

## Introduction, Feature Card=Ecot Title

### ECoT Synthesis

## Introduction, Feature Card=Ecot Desc

Error-driven Chain-of-Thought generates reasoning traces through multi-turn dialogue with real execution environments. Errors from GPU compilers, RTL simulators, and CAD engines become the training signal.

## Introduction, Feature Card=Icwm Title

### Industrial Code World Model

## Introduction, Feature Card=Icwm Desc

ICWM learns the causal dynamics of code→hardware behavior from execution traces. It replaces expensive real backends during large-scale data generation, achieving 96.7% outcome prediction accuracy.

## Introduction, Feature Card=Results Title

### Top-tier Performance

## Introduction, Feature Card=Results Desc

Achieves leading open-source results on 14 general benchmarks (81.3% LiveCodeBench V5) and 9 industrial benchmarks (84.0% CAD-Coder, 38.0% KernelBench L2), outperforming Claude-Sonnet-4.6 and Kimi-K2.5 on industrial tasks.

## Introduction, Element=Fig2 Caption

Figure 2: Performance of InCoder-32B-Thinking compared to Claude-Sonnet-4.6, Kimi-K2.5, and Qwen3.5-397B-A17B across 10 benchmarks spanning general and industrial code domains.

## Methodology, Element=H2

### 2. ECoT Synthesis & Industrial Code World Model

## Methodology, Element=Intro P

The core methodology operates in two phases: grounded collection — where a frontier LLM generates reasoning traces validated by real execution backends — and ICWM-driven amplification — where a trained world model replaces expensive toolchain invocations to scale up data generation efficiently.

## Methodology, Element=Fig4 Caption

Figure 4: The ECoT data synthesis pipeline. Domain tasks are seeded with execution environments, reasoning traces are elicited via a frontier LLM, real backends (GPU compilers, RTL simulators, CAD engines) validate each code revision, and the resulting multi-turn trajectories train the ICWM.

## Methodology, Element=Steps H3

### Error-driven Chain-of-Thought (ECoT) Pipeline

## Methodology, Step1 Title

### Task Seeding & Environment Bundling

## Methodology, Step1 Desc

Domain tasks from chip design, GPU optimization, 3D modeling, and embedded systems are paired with their required execution environments: Verilog modules bundled with testbenches and synthesis constraints, CUDA kernels with memory profiles, CadQuery scripts with geometry validation tests.

## Methodology, Step2 Title

### Execution-Grounded Trajectory Synthesis

## Methodology, Step2 Desc

A frontier LLM generates a reasoning trace and candidate code. The code is sent to domain-specific real backends: Triton/CUDA for GPU kernels, Renode for microcontroller firmware, CadQuery for 3D geometry, Yosys/Icarus for RTL. Each backend returns an outcome label (PASS, COMPILATION_ERROR, MEMORY_FAULT) plus diagnostic logs. Errors become the next-turn observation, driving iterative refinement.

## Methodology, Step3 Title

### Multi-turn Trajectory Formation

## Methodology, Step3 Desc

Both successful and failed intermediate turns are retained, so training data captures common failure modes alongside the reasoning steps that resolve them. The trajectory T = [(S_init, r⁽⁰⁾, c⁽⁰⁾) → (r⁽¹⁾, c⁽¹⁾) → ... → (r⁽ᵏ⁾, c⁽ᵏ⁾)] forms the core of the ECoT dataset.

## Methodology, Element=Icwm H3

### Industrial Code World Model (ICWM)

## Methodology, Element=Icwm P1

Real execution backends provide reliable supervision but are expensive — each interaction requires invoking domain-specific toolchains. To scale up data generation, we train an Industrial Code World Model (ICWM) : a language model that predicts what a real backend would return, given the execution environment and candidate code:

## Methodology, Element=Icwm P2

Where the predicted output ô⁽ᵏ⁾ includes an outcome label (PASS, COMPILATION_ERROR, etc.), a diagnostic message, and numerical outputs or diff summaries. The ICWM is trained on all collected execution trajectories. Once trained, it replaces real backends in the feedback loop — each prediction is a single forward pass rather than a real compilation or simulation, enabling 100× faster trajectory generation.

## Methodology, Benefit1

### 100× faster than real backend invocation — single forward pass per prediction step

## Methodology, Benefit2

### All ICWM trajectories verified against real execution — final corpus D = D_real ∪ D_icwm

## Methodology, Element=Pipeline Illust Caption

Conceptual illustration: ECoT synthesis pipeline connecting GPU chips, RTL schematics, and embedded systems through a reasoning LLM to generate verified training trajectories.

## Methodology, Element=Cuda H3

### Thinking in Action: CUDA Kernel Example

## Methodology, Element=Cuda Caption

Figure 3: Implementing a CUDA Hinge Loss kernel. Without thinking, InCoder-32B produces incorrect code (shape mismatch: 2D predictions indexed as 1D). With ECoT training, InCoder-32B-Thinking systematically identifies the mismatch, infers broadcasting semantics, maps flat indices to row indices, and generates correct grid-stride loop code.

## Evaluation, Element=H2

### 3. Evaluation

## Evaluation, Element=Intro P

We evaluate InCoder-32B-Thinking on a comprehensive suite covering both general-purpose coding and specialized industrial domains, comparing against contemporary models including Claude-Sonnet-4.6, Kimi-K2.5, Qwen3.5-397B-A17B, and DeepSeek-V3.

## Evaluation, Element=General H3

### General Code Benchmarks

## Evaluation, Element=Industrial H3

### Industrial Code Benchmarks

## Evaluation, Element=General Results H3

### General Code Results

## Evaluation, Element=General Results P

The most striking finding is the leap in code reasoning. InCoder-32B-Thinking scores 81.3 on LiveCodeBench V5 — comparable to proprietary frontier models — and 53.3 on CruxEval Input-COT, demonstrating that thinking augmentation dramatically improves multi-step reasoning over code execution traces.

## Evaluation, Element=Industrial Results H3

### Industrial Code Results

## Evaluation, Element=Industrial Results P

On industrial benchmarks, InCoder-32B-Thinking consistently outperforms competing models. It achieves 84.0% on CAD-Coder Compile Pass (vs. Claude-Sonnet-4.6's 22.2%), 82.0% on VeriScope Score, 78.6% on RealBench Module Syn@1 (vs. Kimi-K2.5's 50.1%), and 38.0% on KernelBench L2. These results validate that ECoT + ICWM training transfers effectively across all industrial coding sub-domains.

## Evaluation, Element=Chip Table H4

### Chip Design Benchmark Results

## Evaluation, Element=Gpu Table H4

### GPU Optimization Benchmark Results

## Evaluation, Element=Overview Fig Caption

Figure 1: Overview of InCoder-32B-Thinking capabilities. Left: Reflective Depth Reasoning — iterative error-correction through multi-turn GPU kernel debugging (FAIL → think → PASS). Right: Domain Reasoning — hardware-aware reasoning chains for industrial code tasks. Center: the model bridges general and industrial code intelligence.

## Analysis, Element=H2

### 4. Analysis

## Analysis, Element=Fidelity H3

### 4.1 ICWM Fidelity Analysis

## Analysis, Element=Fidelity P

A core assumption of the pipeline is that ICWM can faithfully replace real execution backends during large-scale trajectory synthesis. We validate this by holding out 2,000 execution turns from each industrial domain and measuring how closely ICWM predictions match real backend results — across both per-turn outcome labels and end-to-end trajectory verdicts.

## Analysis, Element=Fidelity Fig Caption

Figure 5: ICWM fidelity across five industrial domains. Purple bars show per-turn outcome prediction accuracy; orange bars show end-to-end trajectory agreement with real execution. All domains exceed 93% on both metrics.

## Analysis, Element=Fidelity Label

Format: Outcome Prediction Accuracy / Trajectory Agreement. Chip design achieves highest fidelity (97.4%/95.8%); 3D modeling has the widest gap due to floating-point tolerance complexity in CadQuery geometry checks.

## Analysis, Element=Thinking H3

### 4.2 Adaptive Thinking Depth

## Analysis, Element=Thinking P

A key property of InCoder-32B-Thinking is that it allocates reasoning compute proportional to task complexity. Analysis of the training corpus reveals a 209× range in thinking token lengths across task categories, driven by the real execution feedback — not by a fixed prompt template.

## Analysis, Element=Thinking Fig Caption

Figure 6: Distribution of thinking block lengths (median and interquartile range P25–P75) per task category, sorted by thinking depth. Industrial domains (highlighted) consistently require deeper reasoning than general coding tasks.

## Analysis, Thinking Insight1

GPU Optimization — median thinking length, requiring multiple hardware constraint analyses per correction

## Analysis, Thinking Insight2

Range of thinking depth across task types — from agentic coding (shortest) to GPU optimization (longest)

## Analysis, Thinking Insight3

Agentic Coding — shortest thinking chains, reflecting clear state machine structure of tool-use tasks

## Analysis, Element=Scaling H3

### 4.3 Effects of Thinking Training Data

## Analysis, Element=Scaling P

To understand how training data scale affects performance, we trained checkpoints at 180M, 360M, and 540M tokens of thinking data. Across 9 industrial benchmarks, consistent improvement is observed as data scales — with TritonBench GPU execution correctness maintaining a perfect 100% across all stages, indicating that some capabilities emerge early and remain stable. The thinking mechanism consistently adds value beyond the base InCoder-32B model.

## Analysis, Element=Scaling Fig Caption

Figure 7: Performance across 9 industrial benchmarks as thinking training data scales from 180M to 540M tokens. Most metrics improve monotonically; TritonBench GPU execution correctness plateaus at 100% across all stages.

## Conclusion, Element=H2

### 5. Conclusion

## Conclusion, Element=P

InCoder-32B-Thinking demonstrates that the gap between general code intelligence and industrial software development can be bridged through execution-grounded thinking data. By combining Error-driven Chain-of-Thought synthesis with an Industrial Code World Model, the framework creates training data that captures the real reasoning depth required for industrial code tasks:

## Conclusion, List Item1

ECoT synthesis generates high-quality reasoning traces by learning from multi-turn execution errors — no human annotation required.

## Conclusion, List Item2

ICWM achieves 96.7% outcome prediction accuracy, enabling scalable trajectory generation without expensive real-backend invocations.

## Conclusion, List Item3

Adaptive thinking depth (209× range) reflects real task complexity — GPU optimization demands 19K character reasoning chains vs. 91 chars for agentic coding.

## Conclusion, List Item4

Top-tier results on 14 general + 9 industrial benchmarks, consistently outperforming Claude-Sonnet-4.6, Kimi-K2.5, and Qwen3.5-397B-A17B on industrial tasks.

## Conclusion, Resource Link

### Read Paper on arXiv ↗

## Related Work, Element=Summary

### Related Work

## Related Work, Element=Industrial H4

### Industrial Code Intelligence

## Related Work, Element=Industrial P

Prior work has addressed individual industrial sub-domains in isolation: Verilog generation and repair (RTLCoder, VeriGen), GPU kernel optimization (KernelBench, TritonBench), embedded systems coding, and 3D modeling code. InCoder-32B represents a step toward unification across sub-domains, but without thinking capabilities. InCoder-32B-Thinking extends this foundation with execution-grounded reasoning data.

## Related Work, Element=Thinking H4

### Thinking in Large Language Models

## Related Work, Element=Thinking P

OpenAI o1 demonstrated that long chains of thought learned via RL dramatically improve complex reasoning. DeepSeek-R1 and related work showed that structured thinking can emerge from GRPO-based training. For code-specific reasoning, o1-Coder and rStar-Coder adapted thinking techniques to programming tasks. InCoder-32B-Thinking extends these approaches specifically to industrial domains where execution environments provide objective feedback signals.

## References, Element=Summary

### References (Selected)

## Footer, Element=P

InCoder-32B-Thinking · Beihang University, IQuest Research · arXiv:2604.03144 · Published via Flecto