Industrial Code World Model for Thinking
Industrial software development β spanning chip design, GPU kernel optimization, and embedded systems β has long lacked the kind of expert reasoning traces that show how engineers think through hardware constraints and timing semantics. Unlike general coding, where test cases and linters give quick feedback, industrial code requires understanding physical execution environments: does this Triton kernel exceed the GPU's shared memory limit? Will this Verilog module synthesize correctly under timing constraints?
InCoder-32B-Thinking addresses this gap through two synergistic components: Error-driven Chain-of-Thought (ECoT) synthesis that generates reasoning traces by learning from execution errors, and an Industrial Code World Model (ICWM) that predicts hardware execution outcomes without invoking real backends β enabling scalable generation of high-quality training data. The result is a 32B parameter model that achieves top-tier open-source performance across both general and industrial code benchmarks.
Large language models have made remarkable progress in general software engineering β writing functions, debugging scripts, and passing competitive programming benchmarks. Yet industrial code domains tell a very different story. Leading models achieve limited success on Triton kernel generation and Verilog equivalence checking, despite strong general performance.
The root cause is a data gap: industrial domains lack the expert reasoning traces that show how engineers reason through hardware constraints, timing semantics, and domain-specific execution feedback. A Verilog fix isn't just about syntax β it requires understanding how the RTL maps to gate-level logic and timing paths. A Triton kernel requires reasoning about GPU memory hierarchy, warp scheduling, and numerical precision.
InCoder-32B-Thinking combines thinking model capabilities (deliberate, error-correcting multi-turn reasoning) with industrial code world model knowledge (causal dynamics of how code affects hardware behavior). The model is trained on data generated by the ECoT synthesis framework and validated by the ICWM, creating a virtuous cycle of execution-grounded reasoning.
Error-driven Chain-of-Thought generates reasoning traces through multi-turn dialogue with real execution environments. Errors from GPU compilers, RTL simulators, and CAD engines become the training signal.
ICWM learns the causal dynamics of codeβhardware behavior from execution traces. It replaces expensive real backends during large-scale data generation, achieving 96.7% outcome prediction accuracy.
Achieves leading open-source results on 14 general benchmarks (81.3% LiveCodeBench V5) and 9 industrial benchmarks (84.0% CAD-Coder, 38.0% KernelBench L2), outperforming Claude-Sonnet-4.6 and Kimi-K2.5 on industrial tasks.
The core methodology operates in two phases: grounded collection β where a frontier LLM generates reasoning traces validated by real execution backends β and ICWM-driven amplification β where a trained world model replaces expensive toolchain invocations to scale up data generation efficiently.
Traditional fine-tuning teaches a model the correct answer. ECoT teaches the reasoning process β making mistakes, reading error feedback, and correcting course. This mirrors how expert engineers work: write code, run it, read the error, understand why, fix it, repeat.
The key insight: error messages from compilers and simulators are rich training signals. A CUDA kernel memory fault tells you exactly which memory access pattern was wrong. ECoT harvests these domain-specific error signals as supervision data.
Domain tasks from chip design, GPU optimization, 3D modeling, and embedded systems are paired with their required execution environments: Verilog modules bundled with testbenches and synthesis constraints, CUDA kernels with memory profiles, CadQuery scripts with geometry validation tests.
A frontier LLM generates a reasoning trace and candidate code. The code is sent to domain-specific real backends: Triton/CUDA for GPU kernels, Renode for microcontroller firmware, CadQuery for 3D geometry, Yosys/Icarus for RTL. Each backend returns an outcome label (PASS, COMPILATION_ERROR, MEMORY_FAULT) plus diagnostic logs. Errors become the next-turn observation, driving iterative refinement.
Both successful and failed intermediate turns are retained, so training data captures common failure modes alongside the reasoning steps that resolve them. The trajectory T = [(S_init, rβ½β°βΎ, cβ½β°βΎ) β (rβ½ΒΉβΎ, cβ½ΒΉβΎ) β ... β (rβ½α΅βΎ, cβ½α΅βΎ)] forms the core of the ECoT dataset.
The trajectory T = [(S_init, r(0), c(0)) β (r(1), c(1)) β ... β (r(k), c(k))] represents a conversation-like sequence where: S_init is the task + environment, r(n) is the reasoning trace at turn n, c(n) is the candidate code. Each arrow represents one round of execution feedback: the code runs, something fails, the failure becomes the next input.
Crucially, both failed and successful intermediate turns are kept as training data. This teaches the model that errors are part of the process β not noise to be minimized.
Real execution backends provide reliable supervision but are expensive β each interaction requires invoking domain-specific toolchains. To scale up data generation, we train an Industrial Code World Model (ICWM): a language model that predicts what a real backend would return, given the execution environment and candidate code:
Where the predicted output Γ΄β½α΅βΎ includes an outcome label (PASS, COMPILATION_ERROR, etc.), a diagnostic message, and numerical outputs or diff summaries. The ICWM is trained on all collected execution trajectories. Once trained, it replaces real backends in the feedback loop β each prediction is a single forward pass rather than a real compilation or simulation, enabling 100Γ faster trajectory generation.
Running real industrial toolchains is expensive. Invoking a Verilog synthesis tool (Yosys) takes minutes. Running a GPU kernel and profiling memory access requires actual hardware. If you need millions of training trajectories, you cannot invoke real backends for each step.
The ICWM solves this by predicting what the real backend would return β as a language model, it is just a forward pass. Analogous to how AlphaZero uses a neural network to simulate chess positions rather than playing every position on a real board. ICWM achieves 96.7% accuracy at predicting real execution outcomes.
The task: compute Hinge Loss where predictions p has shape (32768, 32768) β a 2D matrix β and targets t has shape (32768,) β a 1D vector. The base model used a flat 1D index predictions[idx], treating the 2D matrix as if it were 1D. This reads wrong memory across row boundaries.
The thinking model's fix: use a grid-stride loop with proper 2D decomposition β compute batch_idx = i / input_size to determine which row, then index targets[batch_idx] instead of targets[idx]. This correctly broadcasts the 1D targets across all columns of the 2D predictions matrix.
We evaluate InCoder-32B-Thinking on a comprehensive suite covering both general-purpose coding and specialized industrial domains, comparing against contemporary models including Claude-Sonnet-4.6, Kimi-K2.5, Qwen3.5-397B-A17B, and DeepSeek-V3.
The most striking finding is the leap in code reasoning. InCoder-32B-Thinking scores 81.3 on LiveCodeBench V5 β comparable to proprietary frontier models β and 53.3 on CruxEval Input-COT, demonstrating that thinking augmentation dramatically improves multi-step reasoning over code execution traces.
On industrial benchmarks, InCoder-32B-Thinking consistently outperforms competing models. It achieves 84.0% on CAD-Coder Compile Pass (vs. Claude-Sonnet-4.6's 22.2%), 82.0% on VeriScope Score, 78.6% on RealBench Module Syn@1 (vs. Kimi-K2.5's 50.1%), and 38.0% on KernelBench L2. These results validate that ECoT + ICWM training transfers effectively across all industrial coding sub-domains.
| Model | Size | VeriScope Score | VeriRepair Fix% | Sys Syn@1 | Sys Syn@5 | Module Syn@1 | Module Syn@5 |
|---|---|---|---|---|---|---|---|
| Qwen3-Coder | 3.6/21B | 73.9 | 86.7 | 3.8 | 17.4 | 22.9 | 47.9 |
| Qwen3.5 | 17/397B | 62.5 | 86.7 | 11.2 | 38.1 | 35.2 | 59.5 |
| Kimi-K2.5 | 32B/1T | 82.4 | 76.7 | 6.2 | 26.2 | 50.1 | 70.1 |
| Claude-Sonnet-4.6 | β | 83.2 | 90.0 | 2.5 | 11.2 | 22.2 | 43.4 |
| InCoder-32B-Thinking | 32B | 82.0 | 53.5 | 35.2 | 91.0 | 82.0 | β |
Triton is OpenAI's Python-based GPU programming language for writing high-performance GPU kernels. TritonBench measures whether LLMs can write correct Triton code that: (G-call) generates runnable kernels, and (G-exe) produces kernels that execute correctly on GPU hardware.
KernelBench measures whether models can replace PyTorch operators with custom CUDA/Triton kernels that run at comparable or better speed (L1: 10% speedup, L2: 50% speedup, L3: 100% speedup over PyTorch baseline). InCoder-32B-Thinking scores 38.0% on KernelBench L2 β meaning its kernels achieve 1.5Γ PyTorch speedup 38% of the time.
| Model | Size | TritonBench G-call% | TritonBench G-exe% | KernelBench L1 | KernelBench L2 |
|---|---|---|---|---|---|
| Qwen3.5 | 17/397B | 7.6 | 100.0 | 4.0 | 10.0 |
| Kimi-K2.5 | 32B/1T | 17.4 | 100.0 | 9.1 | 16.0 |
| Claude-Sonnet-4.6 | β | 1.6 | 100.0 | 16.2 | 23.0 |
| InCoder-32B-Thinking | 32B | 18.5 | 100.0 | 22.2 | 38.0 |
A core assumption of the pipeline is that ICWM can faithfully replace real execution backends during large-scale trajectory synthesis. We validate this by holding out 2,000 execution turns from each industrial domain and measuring how closely ICWM predictions match real backend results β across both per-turn outcome labels and end-to-end trajectory verdicts.
Outcome Prediction Accuracy (96.7%): On held-out execution turns, ICWM correctly predicts the per-step outcome label (PASS / COMPILATION_ERROR / MEMORY_FAULT) 96.7% of the time. This measures single-step accuracy.
Trajectory Agreement (94.4%): Whether the entire multi-turn trajectory ends with the same final verdict as real execution. Always lower than per-step accuracy because errors can compound over multiple turns.
Format: Outcome Prediction Accuracy / Trajectory Agreement. Chip design achieves highest fidelity (97.4%/95.8%); 3D modeling has the widest gap due to floating-point tolerance complexity in CadQuery geometry checks.
A key property of InCoder-32B-Thinking is that it allocates reasoning compute proportional to task complexity. Analysis of the training corpus reveals a 209Γ range in thinking token lengths across task categories, driven by the real execution feedback β not by a fixed prompt template.
GPU kernel optimization requires reasoning about multiple interacting hardware constraints simultaneously: L1 cache limits (32KB per SM), warp scheduling (all 32 threads in a warp execute together), memory access coalescing (128-byte cache lines), shared memory bank conflicts, register pressure vs. thread-block size tradeoffs. Each correction requires re-evaluating all these constraints. The 19K character median reflects genuine multi-step hardware reasoning β not verbosity.
By contrast, agentic coding tasks (tool use, file operations) have clear state machines: observe, decide, act. The path to the answer is short and well-defined.
To understand how training data scale affects performance, we trained checkpoints at 180M, 360M, and 540M tokens of thinking data. Across 9 industrial benchmarks, consistent improvement is observed as data scales β with TritonBench GPU execution correctness maintaining a perfect 100% across all stages, indicating that some capabilities emerge early and remain stable. The thinking mechanism consistently adds value beyond the base InCoder-32B model.
InCoder-32B-Thinking demonstrates that the gap between general code intelligence and industrial software development can be bridged through execution-grounded thinking data. By combining Error-driven Chain-of-Thought synthesis with an Industrial Code World Model, the framework creates training data that captures the real reasoning depth required for industrial code tasks:
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages β we turn any source material into production-quality content. Rich HTML Β· Custom slides Β· Animated video.