arXiv:2604.07413 · cs.CV · Apr 2026

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

*Equal contribution. Xiangru Jian*, Hao Xu*, Wei Pang* · University of Waterloo · University of Sydney · CUHK-Shenzhen

12,972 Annotated Samples

18 MLLMs Evaluated

+25.6% SFT Improvement

Manufacturing AI needs more than object recognition — it needs to identify exact model numbers, detect microscopic surface defects, and verify complex assemblies. FORGE is the first benchmark to rigorously test all three, revealing that domain knowledge, not visual grounding, is where today's best models fall short.

Read on arXiv ↗ Project Page ↗ Code ↗ Dataset ↗

ABSTRACT

Abstract

The manufacturing sector is rapidly adopting multimodal large language models (MLLMs) to move beyond simple perception toward autonomous execution. Yet current evaluation benchmarks fail to reflect the rigorous demands of real-world manufacturing — they lack fine-grained domain semantics, support only 2D images, and don't test the nuanced reasoning that factory automation actually requires.

FORGE addresses this gap with a high-quality multimodal dataset of 12,972 annotated samples that combines 2D images and 3D point clouds. The benchmark covers three cognitively demanding manufacturing tasks: workpiece verification, structural surface inspection, and assembly verification — each requiring fine-grained annotations down to exact model numbers.

Through systematic evaluation of 18 state-of-the-art MLLMs across multiple settings, FORGE reveals a clear finding: task-domain knowledge and morphology understanding are the primary bottleneck for current models — not visual grounding. Furthermore, fine-tuning a 3B-parameter model on FORGE's training split achieves a +25.6% improvement, matching a model 78× larger.

Introduction

Modern manufacturing is increasingly relying on AI vision systems to automate quality control and assembly verification tasks that were previously handled by human experts. Multimodal large language models offer a promising path — they can process images and text together to perform complex reasoning. But there's a problem: existing benchmarks are not designed for the demands of real manufacturing environments.

Prior benchmarks like MMAD, MME-Industry, and DesignQA each cover part of the picture — but none combines real-world data, 3D point clouds, fine-grained model-number annotations, and multiple cognitive task types in a single rigorous framework. FORGE fills this gap with four core contributions:

High-Quality Multimodal Dataset

First fine-grained manufacturing dataset with synchronized 2D images and 3D point clouds, annotated with exact model numbers — not just coarse categories.

Three Real-World Cognitive Tasks

WORKVERI, SURFINSP, and ASSYVERI capture the three most demanding inspection scenarios in real manufacturing: defect verification, surface inspection, and assembly checking.

Extensive MLLM Benchmarking

Systematic evaluation of 18 MLLMs — both open-source and closed-source — across 4 evaluation settings with bottleneck analysis that identifies where models actually fail.

SFT Training Resource

FORGE's training split enables domain-specific fine-tuning. A 3B-parameter model fine-tuned on FORGE achieves +25.6% improvement and matches a 235B model — 78× larger — on workpiece verification.

Table 1: Comparison of FORGE with existing manufacturing and industrial benchmarks. FORGE is the only benchmark combining 2D images, 3D point clouds, real-world data, and fine-grained model-number annotations.
Benchmark	Image	3D	Real	Scenario	Workpiece	Model No.	Samples
MMAD	✓	✗	Real	✓	✓	✗	39,672
MME-Industry	✓	✗	Real	✗	✗	✗	1,050
DesignQA	✗	✗	Synthetic	✓	✓	✗	1,451
FailureSensorIQ	✗	✗	Real	✓	✗	✗	8,296
EngDesign	✓	✗	Synthetic	✓	✗	✗	1,717
FORGE (Ours)	✓	✓	Real	✓	✓	✓	12,972

The FORGE Benchmark

Dataset Overview

FORGE contains 12,972 samples spanning a diverse range of manufacturing workpieces — bolts, screws, brackets, gears, and assemblies. Each sample combines a 2D image with a 3D point cloud captured from the same physical part, giving models complementary visual information at different scales and modalities.

Crucially, annotations go beyond coarse categories. Each sample is labeled with the exact model number of the workpiece — the fine-grained identifier that manufacturing quality control actually depends on. This is what distinguishes FORGE from all prior benchmarks, and it's the property that reveals the domain knowledge gap in current MLLMs.

FORGE dataset overview — sample images showing workpieces and surface defects — Figure 3: Sample data from the FORGE dataset. Left: 3D multi-view images used in workpiece verification tasks. Right: 2D surface images used in structural surface inspection. Annotations include workpiece type, exact model number, defect type, and defect location.

Three Manufacturing Tasks

FORGE defines three tasks that mirror the core cognitive challenges in manufacturing quality control. Each task requires a different combination of visual and linguistic reasoning capabilities:

WORKVERI

Workpiece Verification

Given three-view images of a workpiece, determine if it is defective and identify the defect type. Tests fine-grained morphology understanding and knowledge of what defects look like on specific part types.

SURFINSP

Structural Surface Inspection

Detect surface defects (cracks, dents, corrosion) from 2D images at microscopic scale. The hardest task — requires detecting subtle visual anomalies that challenge even expert human inspectors.

ASSYVERI

Assembly Verification

Determine whether multiple components are correctly assembled. Requires spatial reasoning about component relationships and understanding of how parts should fit together.

Manufacturing quality control scenarios illustration — Three manufacturing quality control scenarios: (left) workpiece defect verification with 3D point cloud overlay, (center) microscopic surface crack detection, (right) multi-part assembly alignment checking.

Evaluation Settings

FORGE evaluates MLLMs across four settings: Zero-Shot (standard single image), Reference-Conditioned (Ref-Cond) (model given reference image of correct part), In-Context Demonstration (ICD) (few-shot examples), and Three-View (3V) (multi-angle images from 3D scanner).

Two input granularity levels are tested: model-level (identify exact model number) and workpiece-level (identify part category only). The gap between these two levels quantifies the fine-grained recognition challenge.

Experiments & Results

18 MLLMs were evaluated across all FORGE tasks and settings — including major open-source models (Gemma-3-27B, InternVL3-78B, Llama-4-MAV, Qwen2.5-VL series) and closed-source frontier models (GPT-4o, Claude Opus 4.5, Gemini-2.5-Flash, o3). The results reveal systematic weaknesses that cut across model families and scales.

Mean accuracy comparison of 18 MLLMs on FORGE benchmark — Figure 4: Mean accuracy across all 18 MLLMs on the FORGE benchmark. Even the best closed-source models achieve only ~80% in the easiest setting (workpiece-level zero-shot), with performance dropping sharply on fine-grained model-number tasks.

Key Findings

SURFINSP Is Hardest

Surface inspection remains the most challenging task across all models. Microscopic crack detection requires visual sensitivity that current MLLMs consistently lack, regardless of model size.

Domain Knowledge Bottleneck

The Reference-Conditioned strategy is inconsistent — providing a reference image doesn't reliably help. This confirms that the problem is upstream: models lack the manufacturing domain knowledge to interpret what they're seeing.

3D Context Can Hinder

Surprisingly, three-view zero-shot often outperforms Ref-Cond and ICD settings. MLLMs struggle to integrate 3D contextual information effectively — adding more context can hurt when models lack the framework to use it.

Model-Number Gap

Model-level tasks (exact model number identification) are substantially harder than workpiece-level tasks. This fine-grained recognition gap — the core novelty of FORGE — is where all models struggle most.

Error Analysis

Analysis of failure cases reveals recurring patterns: models over-rely on material properties ("this looks like plastic") instead of morphological features, and fail on exact model identification even when they show emergent understanding of related service conditions. For example, Gemini-2.5-Flash incorrectly identifies a metal Flat Washer as "plastic/nylon" based on color alone.

04.5

Bottleneck Analysis

To understand why models fail, the authors conducted targeted ablation experiments that isolate visual grounding from domain knowledge. The results are clear: the bottleneck is not in how models see, but in what they know.

Visual Grounding Is NOT the Bottleneck

When parts are labeled with letters (Set-of-Mark prompting) and models are asked to identify them by coordinate, performance is acceptable — models can localize and reference parts correctly. Single-image and cross-image visual grounding work adequately. The problem lies elsewhere.

Domain Knowledge Is the Bottleneck

Missing-part detection experiments show that MLLMs fail to recognize specific model numbers even when visual grounding succeeds. The gap between knowing where a part is and knowing what it is (its exact type, model, defect pattern) is entirely a domain knowledge problem.

3D Point Clouds Need Visual Rendering

Feeding raw 3D point cloud coordinates as serialized text tokens achieves near-random accuracy. MLLMs cannot process 3D data in text form — they require 2D visual projections of the point cloud. This confirms that visual rendering of 3D data (not raw coordinate arrays) is the correct input modality.

Bottleneck analysis: visual grounding vs domain knowledge — Figure 7: Bottleneck analysis experiments. Visual grounding tasks show acceptable performance, confirming that the primary limitation is domain knowledge rather than visual perception capability.

04.6

SFT Training Resource

+25.6%

Improvement on WorkVeri 3V after fine-tuning Qwen2.5-VL-3B on FORGE

+6.5% on AssyVeri Image · 90.8% relative gain on WorkVeri

Beyond evaluation, FORGE functions as an actionable training resource. Fine-tuning Qwen2.5-VL-3B on FORGE's training split produces dramatic performance gains across the most challenging tasks — without any architectural changes or additional data sources.

SFT training results: Qwen2.5-VL-3B before and after fine-tuning on FORGE — Figure 6: Performance comparison of Qwen2.5-VL-3B before (base) and after fine-tuning on FORGE. WorkVeri 3V improves from 28.2% to 53.8% (+25.6 points). AssyVeri Image improves from 24.0% to 30.5% (+6.5 points).

The 90.8% relative improvement on WorkVeri 3V brings the 3B-parameter model to 53.8% — matching Qwen3-VL-235B at 54.4%, a model 78× larger. Crucially, these gains generalize to held-out product categories not seen during fine-tuning, demonstrating true domain adaptation rather than simple memorization.

Conclusion

FORGE introduces a fine-grained multimodal benchmark built from real-world manufacturing data — 12,972 samples combining 2D images and 3D point clouds across three cognitively demanding tasks. It is the first benchmark to provide exact model-number annotations and to evaluate MLLMs on the kind of precision reasoning that manufacturing automation actually requires.

Evaluation of 18 state-of-the-art MLLMs reveals a clear finding: current models can handle macroscopic part recognition but consistently fail at fine-grained reasoning and microscopic surface analysis. Visual grounding is not the limiting factor — the bottleneck is manufacturing domain knowledge and morphology understanding. This insight should guide where the community invests in future model development.

FORGE is also demonstrated to be a valuable training resource: a 3B-parameter model fine-tuned on FORGE achieves performance matching a model 78× larger. As manufacturing AI matures, benchmarks like FORGE that demand genuine domain expertise will be essential for measuring and driving real progress.

REF References

Akcay, S., et al. (2022). MMAD: Massive Multimodal Anomaly Detection. arXiv:2211.02656.
Bai, J., et al. (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
Chen, L., et al. (2024). InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. CVPR 2024.
Chiang, W.-L., et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132.
Dosovitskiy, A., et al. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Gemini Team, Google. (2024). Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805.
He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.
Hu, Z., et al. (2024). DesignQA: A Multi-Modal Benchmark Evaluating LLMs' Understanding of Engineering Documentation. arXiv:2404.07917.
Jiang, A. Q., et al. (2023). Mistral 7B. arXiv:2310.06825.
Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
Liu, H., et al. (2024). Visual Instruction Tuning. NeurIPS 2023.
Luo, W., et al. (2023). FailureSensorIQ. arXiv.
OpenAI. (2024). GPT-4 Technical Report. arXiv:2303.08774.
Qi, C. R., et al. (2017). PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. CVPR 2017.
Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. ICML 2021.
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
Team, Q. (2024). Qwen2.5-VL Technical Report. arXiv:2502.13923.
Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
Wang, J., et al. (2023). MME-Industry: A Comprehensive Benchmark for Industry MLLMs. arXiv.
Wang, P., et al. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv:2409.12191.
Yang, A., et al. (2024). Qwen2 Technical Report. arXiv:2407.10671.
Zhang, Z., et al. (2024). InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model. arXiv:2401.16420.

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Abstract

Introduction

High-Quality Multimodal Dataset

Three Real-World Cognitive Tasks

Extensive MLLM Benchmarking

SFT Training Resource

Related Work

The FORGE Benchmark

Dataset Overview

Three Manufacturing Tasks

Workpiece Verification

Structural Surface Inspection

Assembly Verification

Evaluation Settings

Experiments & Results

Key Findings

SURFINSP Is Hardest

Domain Knowledge Bottleneck

3D Context Can Hinder

Model-Number Gap

Error Analysis

Bottleneck Analysis

Visual Grounding Is NOT the Bottleneck

Domain Knowledge Is the Bottleneck

3D Point Clouds Need Visual Rendering

SFT Training Resource

Conclusion