Think in Strokes, Not Pixels

The Generation Loop

The model builds images stroke by stroke, decision by decision

01

Plan

Generate <ins> instruction (what to add/modify) and <des> description (global scene state)

→

02

Sketch

Synthesize the visual intermediate based on the instruction, conditioned on the current visual state

→

03

Inspect

Detect conflicts between the generated image and the prompt — spatial misalignments, missing objects, attribute errors

→

04

Refine

Issue a <refine> correction command to fix detected issues and regenerate the visual state

Abstract

Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate our method, we conduct experiments under various text-to-image generation benchmarks.

What is 'Interleaved Reasoning'?

In standard image generation, the model receives a text prompt and outputs the final image in one step — like writing an essay without drafting or revising. Interleaved reasoning means alternating between text thoughts and image outputs: the model plans in text, produces a partial image, looks at it, thinks again, and refines. This is analogous to a human artist sketching, stepping back, and making corrections — except here both the planning and the painting are done by the same AI model.

Key Contributions

Process-Driven Paradigm — Plan → Sketch → Inspect → Refine cycle for image generation
Dense Step-Wise Supervision — Spatial/semantic consistency for visual states + prior knowledge for textual states
Unified Model Training — BAGEL-7B trained end-to-end for interleaved text+image token generation
GenEval: 0.79 → 0.83 (+5% absolute gain)
WISE: 0.70 → 0.76 (+6% absolute gain)

Introduction

The Problem with One-Shot Generation

Despite impressive progress in image generation, today's models still remain brittle on elementary visual logic and produce plausible but incorrect images. A prompt of "a bear hovering above a spoon" might incorrectly yield a bear standing beside it. Such one-shot black-box generation forces the model to commit to an entire scene within a single forward pass, resolving precise spatial layouts, object relations, and fine-grained attributes all at once.

Our Solution: Process-Driven Generation

We challenge this outcome-driven paradigm with process-driven image generation via interleaved reasoning anchored in both vision and text. We reformulate image generation as a co-evolving trajectory of textual plan and visual states, orchestrated through a recurring four-stage process: Plan → Sketch → Inspect → Refine. The model does not hallucinate a final image; it constructs the image stroke by stroke, decision by decision.

Figure 1: Process-driven generation comparison — **Figure 1:** Comparison between single-pass generation (left) and process-driven generation (right). The iterative Plan→Sketch→Inspect→Refine cycle enables the model to detect and correct spatial misalignments and instruction conflicts during generation.

Method

3.1 Framework

Most existing image generation models generate images in a single forward pass, sometimes augmented with chain-of-thought reasoning applied exclusively to the textual prompt. However, complex spatial relationship and fine-grained visual details are inherently difficult to encode through this one-shot paradigm, as the model must resolve the entire scene before any visual feedback is available.

Why can't text-only reasoning solve this?

Text chain-of-thought (CoT) asks the model to "think step by step" before generating. But this thinking is visually blind — the model can plan the scene in text, but it cannot see whether its spatial instructions were actually followed until the image is generated. If "place the cat to the left of the bench" gets misinterpreted, text CoT has no way to detect or correct this without looking at the output. Process-driven generation solves this by incorporating visual feedback at each step.

The general framework of our model performs image generation as a sequential, interleaved textual-visual reasoning process. Given a unified multimodal model and an input text prompt, the model generates a trajectory of interleaved text and image tokens. Each iteration consists of: (1) Plan — generate textual instruction specifying the incremental update; (2) Sketch — synthesize a new visual intermediate; (3) Inspect — verify the current visual state against the overall prompt; (4) Refine — issue correction if needed.

Figure 2: Unified Multimodal Reasoning Model Architecture — **Figure 2:** Overview of the Unified Multimodal Reasoning Model. The model processes interleaved text and vision tokens through the Plan→Sketch→Inspect→Refine loop, operating on both instruction tokens (`<ins>`) and scene description tokens (`<des>`).

3.2 Intermediate Reasoning Collection

To train the process-driven generation model, we construct a large-scale dataset of intermediate reasoning trajectories. The data collection pipeline consists of two components: (1) Intermediate Visual State Data — step-by-step instructions derived from scene graphs, with intermediate images generated using Flux-Knotext and filtered by LLMs; (2) Intermediate Textual Critique Data — self-sampled critique traces where the model learns from its own errors through correct/wrong sample pairs.

Figure 3: Training Data Construction Pipeline — **Figure 3:** Training data construction pipeline. Left: Graph-based prompt construction from SubGraph and Structure components. Right: Intermediate visual state data generation using Flux-Knotext, and critique data generation for model training.

3.3 Model (BAGEL-7B)

We train our model to generate text tokens autoregressively, optimizing with Cross-Entropy Loss applied only to textual token positions. To natively generate interleaved sequences, we add a loss term on the <vision_start> and <vision_end> tokens, enabling seamless switching between textual and visual tokens. The model — BAGEL-7B — is trained end-to-end to handle the full interleaved generation loop.

What makes BAGEL-7B a 'Unified' Model?

Traditional AI systems separate understanding and generation: one model reads images, another generates them. A unified multimodal model like BAGEL-7B does both within a single network — it can read image tokens, generate image tokens, and process text tokens all in the same autoregressive framework. This unification is what makes the Plan→Sketch→Inspect→Refine loop possible: the model can examine its own visual output (Inspect) because it can understand images, and then generate corrected images (Refine) because it can generate them too.

Related Work

2.1 Unified Multimodal Models

Unified multimodal models aim to unify visual understanding and generation within a single framework. Early autoregressive approaches (Chameleon, Emu3, Show-o) rely on discrete visual tokenizers such as VQ-VAE to model images as discrete token sequences. More recent approaches (BAGEL, Janus) employ continuous-token generation methods and achieve stronger performance on both understanding and generation tasks.

2.2 Reasoning in Image Generation

Recent studies explore interleaved reasoning in image generation, extending chain-of-thought from text domains to multimodal settings. Early works adopt verification-based or prompt-refinement strategies. Our work fundamentally differs: rather than applying reasoning only to the textual prompt, we integrate visual feedback into the reasoning loop — making the generation process genuinely multimodal.

Experiments

4.2 Quantitative Evaluation

We evaluate our method on two benchmarks: GenEval (compositional text-to-image evaluation) and WISE (world knowledge reasoning in text-to-image generation). Our process-driven approach achieves state-of-the-art results among unified multimodal models.

What do GenEval and WISE measure?

GenEval tests compositional text-to-image generation: can the model correctly place "a red ball to the left of a blue cube"? It checks whether objects, attributes, and spatial relationships in the prompt are faithfully rendered. WISE (World Knowledge Reasoning in Synthesis Evaluation) tests whether the model can use real-world knowledge to generate accurate images — for example, generating "a kangaroo with a joey in its pouch" requires knowing that kangaroos are marsupials. These benchmarks complement each other: GenEval tests spatial logic, WISE tests factual knowledge.

GenEval Score

0.83

↑ +5% from 0.79

vs. BAGEL-7B baseline

WISE Score

0.76

↑ +6% from 0.70

vs. BAGEL-7B baseline

The WISE benchmark assesses world knowledge reasoning in text-to-image generation. Generation-only models achieve moderate performance (0.32–0.50) due to limited multimodal understanding. Our process-driven approach, leveraging interleaved reasoning, achieves significantly higher scores.

Table: GenEval benchmark results — **Table 1:** GenEval benchmark results comparing our method against baseline models. Our approach achieves 0.83 (+5% over BAGEL-7B baseline 0.79).

Table: WISE benchmark results — **Table 2:** WISE benchmark results. Our method achieves 0.76 (+6% over BAGEL-7B baseline 0.70).

4.3 Analysis of Process-Driven Reasoning

We evaluate our approach against two distinct categories of baselines: (1) models without the inspect-and-refine mechanism, and (2) models with only textual chain-of-thought. The results show that the visual feedback loop is essential — without Inspect, the model cannot detect spatial misalignments, and without Refine, detected errors cannot be corrected. The co-evolving textual+visual trajectory is what drives the performance gains.

4.4 Ablation Study

Ablation studies confirm that each component of the process-driven pipeline contributes meaningfully to the final performance. Removing scene-graph subsampling leads to contradictory instructions; removing self-sampled critique data reduces the model's ability to detect and fix errors; and reducing the number of refinement iterations decreases both GenEval and WISE scores.

4.5 Qualitative Evaluation

Figure 4 illustrates the reasoning trajectories produced by our process-driven generation paradigm. The model demonstrates the ability to detect instruction-intermediate conflicts (where the current visual state contradicts the overall prompt) and image-instruction alignment issues (where the scene diverges from the specified layout). In both cases, the Inspect phase correctly identifies the problem and the Refine phase issues targeted corrections.

Figure 4: Process-driven generation with conflict reasoning — **Figure 4:** Qualitative examples of process-driven generation with conflict reasoning. Top: standard generation. Middle: instruction-intermediate conflict detection and correction. Bottom: image-instruction alignment correction.

Generation Gallery

Process-driven generation produces diverse, high-quality images across a wide range of subjects and styles — from photorealistic portraits to complex multi-object scenes.

Conclusion

We introduce a novel process-driven interleaved reasoning paradigm that teaches a unified multimodal model to build images stroke by stroke, decision by decision, via a co-evolving loop of textual planning, visual sketching, self-inspection, and refinement. Our method hinges on three breakthroughs: scene-graph subsampling for contradiction-free incremental instructions, self-sampled critique traces to learn from the model's errors, and end-to-end training of BAGEL-7B to autoregressively emit interleaved text and image tokens.

GenEval: 0.79 → 0.83 (+5%)

WISE: 0.70 → 0.76 (+6%)

We lift the public BAGEL-7B from 0.79 to 0.83 (+5% absolute gain) on GenEval and from 0.70 to 0.76 (+6% absolute gain) on WISE.

Looking forward, we will extend unified multimodal reasoning to videos and 3D space, and enable real-time human-in-the-loop control — unlocking controllable, truthful, and interpretable image synthesis.

Cite This Work

@article{zhang2026think,
  title={Think in Strokes, Not Pixels: Process-Driven Image Generation
         via Interleaved Reasoning},
  author={Zhang, Lei and Tian, Junjiao and Fan, Zhipeng and Li, Kunpeng
          and Wang, Jialiang and Chen, Weifeng and Georgopoulos, Markos
          and Juefei-Xu, Felix and Bao, Yuxiang and McAuley, Julian
          and Li, Manling and He, Zecheng},
  journal={arXiv preprint arXiv:2604.04746},
  year={2026}
}