arXiv:2603.25040v2 [cs.LG] 2 Apr 2026

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Intern-S1-Pro Team, Shanghai AI Laboratory

🔗 Model on HuggingFace

Parameters

First open-source 1-trillion-parameter scientific multimodal model

100+

Scientific Tasks

Chemistry, materials, life sciences, earth sciences & more

Training Tokens

Continued pre-training on scientific multimodal data

270B

Caption Tokens

High-quality scientific image-text caption data from PDFs

Abstract

We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences.

Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.

What does "1 trillion parameters" actually mean?

Model parameters are the learned weights that encode all the knowledge and reasoning ability of a neural network. GPT-3 had 175 billion parameters; current frontier models are in the hundreds of billions. 1 trillion (10¹²) parameters is roughly 5–10× the size of typical large models. However, Intern-S1-Pro uses a Mixture-of-Experts (MoE) architecture where only a fraction of parameters are "active" during any single inference — labeled "1T-A22B" meaning 1T total but only ~22B active per forward pass. This gives the representational capacity of a trillion-parameter dense model at a fraction of the compute cost.

1. Introduction

The advent of Large Language Models (LLMs) and Visual Language Models (VLMs) has fundamentally transformed the landscape of artificial intelligence, offering unprecedented capabilities in reasoning, generation, and multimodal understanding. In the domain of AI for Science (AI4S), these foundation models have emerged as critical tools for accelerating scientific discovery, enabling researchers to tackle complex problems ranging from protein structure prediction to materials design.

To build an effective scientific foundation model, scaling model size is imperative due to the immense diversity inherent in scientific domains. Compared to natural language, science encompasses much more specialized fields — each with its own unique "language", including domain-specific notations, knowledge, and reasoning patterns. A scientific foundation model should possess sufficient capacity to master a wide array of scientific tasks while retaining general text and vision capabilities.

In this work, we introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Following the three-layer SAGE (Synergistic Architecture for Generalizable Experts) framework, we demonstrate that a sufficiently large generalist model, when trained jointly on general and specific tasks, can outperform specialized models in several scientific tasks — contrary to the common belief that specialized models are superior for niche tasks.

At the engineering level, we achieve deep optimization between the XTuner training framework and the LMDeploy inference engine, allowing Intern-S1-Pro to scale to 4× the size of its predecessor (Intern-S1) while incurring only a ~20% reduction in training efficiency.

SAGE Framework — **Figure 1:** The SAGE (Synergistic Architecture for Generalizable Experts) framework used in Intern-S1-Pro development. Three layers: *Foundation* (1T MoE architecture), *Fusion* (scientific + general data), and *Evolution* (Agentic Reinforcement Learning).

2. Architecture

Intern-S1-Pro is derived from Intern-S1 through expert expansion, incorporating a Grouped Routing design to ensure stable trillion-scale MoE training.

2.1 Group Routing

For the training of ultra-large-scale MoE models, Expert Parallelism (EP) serves as the core technical approach to mitigate GPU memory and communication overheads. However, the expert load imbalance caused by the traditional Top-k routing strategy leads to cross-device load imbalance during expert parallel training.

Mixture-of-Experts (MoE) and the load balancing problem

In a standard Transformer, every token passes through the same FFN weights. In a MoE model, each layer has multiple parallel "expert" FFNs, and a learned router sends each token to only the top-k experts. This lets you increase model capacity without proportionally increasing compute. The problem: if the router always sends tokens to the same few popular experts, those GPU devices get overloaded while others sit idle. Traditional Top-K routing causes this imbalance. Grouped Router fixes it by partitioning experts into groups and selecting exactly one expert per group — guaranteeing each device gets an equal workload regardless of input distribution.

We propose to replace the traditional Top-K Router with the Grouped Router to achieve absolute load balancing across devices under the 8-way expert parallelism training strategy. In Grouped Router architecture, all experts are uniformly partitioned into G mutually disjoint groups. For each group g, only the top-(K/G) experts with the highest scores are selected within the group.

Combined with the configuration of the Intern-S1-Pro 1T model (k = 8) and the EP8 training strategy, we divide all experts into 8 groups and select the Top-1 expert within each group, achieving absolute load balancing across devices. This approach not only significantly improves training efficiency but also fundamentally eliminates the OOM risk during training.

Expert Expansion and Router Update — **Figure 2:** Left: Expert expansion from Intern-S1 (3 experts) to Intern-S1-Pro (9 experts via Grouped Routing initialization). Right: Dense Update of Router via Straight-Through Estimator.

Load Balancing Comparison — **Figure 3:** Grouped Router achieves absolute load balance across 8 devices, while Top-K Router concentrates multiple experts on Device 2 (imbalanced).

2.2 Straight-Through Estimator for Sparse Expert Routing

MoE architectures scale model capacity by routing each input token to a small subset of K out of N experts via Top-K selection. The layer output is:

Why is the Straight-Through Estimator needed here?

The Top-K selection operation (picking the highest-scoring experts) is not differentiable — it's a discrete argmax, so gradients cannot flow back through it to update the router's weights. This means the router could get "frozen" and fail to learn. The Straight-Through Estimator (STE) is a trick: during the forward pass, use the hard discrete selection; during the backward pass, pretend it was a soft continuous function (a temperature-scaled softmax) and compute gradients through that instead. This allows all router parameters to receive gradient updates every step, keeping the router trainable at scale.

y = \sum i \in S p̃ i \cdot E i (x)

We introduce the Straight-Through Estimator (STE) to decouple the forward and backward passes of the routing operation. The STE routing weight is constructed as:

p̃ i STE = sg(p̃ i) + (p i τ - sg(p i τ))

where p_i^τ = softmax(z/τ)_i is a scaled routing probability and sg(·) is the stop-gradient operator. The gradient of any loss L with respect to logit z_j is:

\partial L /\partialz j = \sum i\inS (\partial L /\partialp̃ i STE) \cdot (\partialp i τ /\partialz j)

Through STE, the router receives consistent data-driven feedback throughout training, enabling all router embeddings to be updated in every pass.

2.3 Vision Encoder

Intern-S1-Pro employs a Native Vision Transformer (ViT) as the vision encoder. The encoder processes images at native resolution, where the visual token count depends on the original input resolution rather than a fixed image size. Visual tokens extracted from the ViT pass through a multilayer perceptron (MLP) projector that maps visual features into the embedding space of the language model.

The training of the encoder uses contrastive learning with large-scale image–text pairs spanning approximately 300 million image–text pairs, including CC12M, LAION-COCO, SBU Caption, LAION-2B-Multi, and Wukong.

2.4 FoPE — Fourier Position Encoding

Traditional positional encoding methods such as RoPE (Rotary Position Embedding) impose a particle-like representation on all modalities, treating information as localized, discrete units. This creates a representational gap for physical signals (images, audio, video) that inherently have wave-interference and spectral properties.

FoPE (Fourier Position Encoding) addresses this limitation by reimagining how transformer models encode position and structure — treating each dimension as a Fourier series of different frequency components, separating information more effectively and mitigating spectral damage. Inadequately trained frequency components are clipped to remove their harmful influence.

RoPE vs FoPE: positional encoding for multimodal signals

Rotary Position Embedding (RoPE) is the dominant positional encoding in modern LLMs (LLaMA, GPT-4, etc.). It encodes token position by rotating embedding vectors, and handles text sequences well. But images, audio, and scientific signals have wave-like, spectral structure — a pixel's meaning depends on its 2D neighborhood in ways that 1D sequential rotation doesn't model well. FoPE treats each embedding dimension as a sinusoidal frequency component (like a Fourier series), which is a more natural representation for spatially or temporally structured data. The "clipping" step removes frequency components that weren't trained enough and would add noise.

FoPE vs RoPE — **Figure 4:** FoPE models each dimension as a Fourier series of different frequency components, separating information and mitigating spectral damage (panel a). Undertrained frequency components are clipped (panel b).

2.5 Time-series Encoder

Time series is a core scientific data modality, capturing temporal evolution of complex processes. The time series module of Intern-S1-Pro features an adaptive subsampling module that partitions continuous signals into local segments (patches), captures local dynamics within each patch, and models long-range dependencies across segments. The number of temporal frames is kept within a controllable range by adaptively determining patch size and stride based on the signal and its sampling rate.

The enhanced module expands coverage to: physiological signal analysis (EEG-based depression detection), marmoset vocalization recognition, and electrocardiography abnormality monitoring, handling sequences from 10⁰ to 10⁶ time steps.

Time Series Module Architecture — **Figure 5:** Architecture and subsampling mechanism of the time series module. (a) Full module stack with Adaptive Subsampling and Time Series Encoder. (b) Dynamic subsampling adjusts patch size and stride based on input sampling rate.

3. Pre-training

Intern-S1-Pro employs a total of 6T tokens of image-text and text data for continued pre-training, with a key upgrade in caption data tailored for scientific images.

3.1 Caption Pipeline

Scientific images from web sources suffer from brief, low-alignment captions. In contrast, PDFs represent the primary carrier of scientific visual content, containing high-information-density figures including experimental results, statistical plots, structural diagrams, and formula derivations.

Why scientific caption quality matters so much

Multimodal models learn to connect visual features to text by training on image–text pairs. If the caption for a plot says "Figure 3" but doesn't explain what the axes mean or what trend to observe, the model learns nothing from that figure. Web-scraped scientific images are especially bad — captions are often just figure numbers or short titles. This paper builds a specialized pipeline to extract sub-figures from PDFs and generate dense, 1000-word captions using domain-expert models (InternVL3.5-241B for science figures, CapRL-32B for others), then filters with a text quality discriminator. The result: 270B tokens of rich scientific image–text data that would otherwise not exist.

We independently constructed a large-scale PDF data production pipeline: extracting sub-figures from massive PDF corpora using MinerU 2.5 for layout analysis, precise deduplication via perceptual hashing (pHash), topic classification and model routing (scientific images → InternVL3.5-241B; non-scientific → CapRL-32B), and a 0.5B-parameter text quality discriminator for filtering. The result: approximately 270B tokens of high-quality scientific image–text caption data.

Caption Quality Comparison — **Figure 6:** Natural captions (often <100 words) vs the caption pipeline output (~1000 words). The pipeline explicitly references visual elements like axis labels, legend colors, and inset panels.

Caption Pipeline Workflow — **Figure 7:** Caption pipeline workflow: PDF Extraction → Deduplication → Topic Clustering → Domain-specific Captioning (InternVL3.5 / CapRL) → Quality Filtering.

3.2 Resolving Conflicts Between Scientific and Textual Data

Directly mixing scientific data (structured, high logical determinism) with general data (semantic depth, linguistic diversity) can lead to distribution shift and negative transfer. Intern-S1-Pro adopts three strategies:

Structured Scientific Data Transformation

Heterogeneous scientific input-output pairs from databases like PubChem are converted to grammatically correct, narrative text via Template Construction and Task Form Transformation, aligning scientific data with the representation style of general data.

Scientific Data Diversification

Prompt Diversification and the Rollout mechanism prevent overfitting on repetitive scientific sequences (e.g., protein sequences). By combining scientific prior knowledge with a strong base model to generate complete reasoning chains, knowledge recall is transformed into logical deduction.

System Prompt Isolation

Mutually exclusive system-level prefixes are injected for scientific and general data during the training cycle, creating independent contextual processing environments for the model. This reduces data conflicts and improves model stability.

4. Post-Training

4.1 Stable Mixed-Precision Reinforcement Learning for Sparse MoE Models

FP8 vs BF16: why mixed precision is tricky for RL

Neural networks normally train in BF16 (16-bit bfloat) to save memory while preserving enough numeric precision. FP8 (8-bit float) halves memory again but is much more numerically sensitive. This matters for RL because the training engine (XTuner) and the rollout/inference engine (LMDeploy) must produce identical probabilities for the same inputs — any mismatch accumulates as policy divergence that can cause training instability. The paper's key contribution here is a suite of fixes: aligning numerically sensitive ops (RMSNorm, softmax) between the two engines, replaying expert routing decisions so the same experts are selected in training as in rollout, and using importance sampling corrections for any remaining policy mismatch.

Scaling RL to trillion-parameter MoE models presents formidable memory challenges. With Intern-S1-Pro featuring 4× the expert count of Intern-S1, we adopt FP8 quantization for the RL phase but implement a comprehensive stabilization framework to preserve performance:

Operator-level precision alignment: Reduced precision gaps between LMDeploy rollout engine and XTuner training engine in numerically sensitive components (RMSNorm, router softmax, positional embedding).
Rollout router replay: Expert routing indices are recorded per layer during rollout and replayed during policy updates to enforce expert selection consistency.
Targeted mixed-precision: Expert linear layers quantized to FP8; non-expert components kept in BF16; FP32 LM head for log-probability fidelity.
Dual importance sampling: Modified REINFORCE objective with training-inference mismatch calibration and off-policy bias correction.

RL Objective

The modified REINFORCE loss with dual importance sampling:

L (θ) = -&Eopf; x~D, {y i} G [ (1/G) \sum i=1 G (1/|y i |) \sum t=1 |y i | sg(M (ρ i,t; α,β) r i,t) \cdot Â i,t \cdot log π θ (y i,t | x, y i,<t) ]

Masking function: $M (ρ i,t; α,β) = ρ i,t if α < ρ i,t < β, else 0$

Advantage estimate: $Â i,t = R i - b i, b i = (1/(G-1)) \sum j\neqi R j$

Validation Accuracy BF16 vs FP8 — **Figure 8(a):** Validation accuracy over 1200 optimizer steps. FP8 mixed-precision (orange) closely tracks BF16 (blue), both converging to ~0.75.

KL Divergence BF16 vs FP8 — **Figure 8(b):** KL divergence between train and rollout engines remains controlled and bounded throughout training for both BF16 and FP8 configurations.

5. Evaluation

5.1 Evaluation Configuration

Table 1: Intern-S1-Pro Evaluation Configurations
	Thinking	Non-Thinking
max tokens	65,536	32,768
temperature	0.8	0
top_p	0.95	1.0
top_k	50	1

5.2 Benchmarks

Scientific:

SciReasoner SFE SmolInstruct MatBench Mol-Instructions MicroVQA Biology-Instruction XLRS-Bench MSEarth-MCQ

General:

MMMU-Pro MMLU-Pro AIME-2025 IMO-Answer-Bench RefCOCO-avg IFBench OCRBench V2 SArena LCB V6 GAIA τ²-Bench ScreenSpot V2

5.3 Main Results

Intern-S1-Pro firmly demonstrates highly competitive capabilities with the first tier of open-source models. In scientific evaluations, it significantly outperforms proprietary models like Gemini-3-Pro and GPT-5.2 on multiple benchmarks.

Table 2: Comprehensive performance comparison across scientific and general benchmarks. **Bold** = highest score, underline = second highest.
		Intern-S1-Pro 1T-A22B	Qwen3-VL-235B 235B-A22B	Kimi-K2.5 1T-A32B	GPT-5.2	Gemini-3-Pro
Scientific Tasks
SciReasoner	Scientific Reasoning	55.5	11.9	15.3	13.6	14.7
SFE	Scientific Multimodal Tasks	52.7	41.4	53.7	47.5	58.9
SmolInstruct	Small Molecule	74.8	36.6	53.5	48.2	58.3
MatBench	Materials Property Prediction	72.8	49.7	60.0	53.6	64.9
Mol-Instructions	Bio-molecular Instruction	48.8	8.9	20.0	12.3	34.6
MicroVQA	Biological Microscopy	63.3	53.8	55.4	60.4	69.0
Biology-Instruction	Multi-Omics Sequence	52.5	6.2	10.7	10.2	12.0
XLRS-Bench	Remote Sensing	52.8	51.2	46.4	50.4	51.8
MSEarth-MCQ	Earth Science	65.2	52.7	61.9	62.6	65.8
General Tasks
MMMU-Pro	Knowledge & Reasoning	72.8	69.9	78.5	79.5	81.0
MMLU-Pro	Knowledge & Reasoning	86.6	83.4	87.1	85.9	89.3
AIME-2025	Math Reasoning	93.1	90.0	96.1	100.0	95.0
IMO-Answer-Bench	Math Reasoning	77.3	72.3	81.8	86.3	81.3
RefCOCO-avg	Visual Grounding	91.9	91.1	87.8	54.9	76.2
IFBench	Instruction Following	71.2	58.7	69.7	75.4	70.4
SArena (Icon)	SVG Generation	83.5	76.3	78.5	—	82.6
LCB V6	Code	74.3	72.0	85.0	87.7	86.9
GAIA (Text-Only)	Agent	77.4	47.8	79.9	71.1	75.5
τ²-Bench	Agent	80.9	57.4	76.8	76.6	85.4
ScreenSpot V2	Agent & Grounding	93.6	92.8	92.4	49.4	94.7

5.4 Time Series Results

Intern-S1-Pro exhibits significantly superior performance over both Text LLMs and Vision-Language LLMs across diverse scientific time series tasks on the SciTS benchmark, validating the effectiveness of the dedicated time series encoder and dynamic subsampling process.

Table 3: Results on a subset of the SciTS benchmark (F1 scores).
	ASU01	ASU03	BIU01	BIU03	EAU01	MEU01	NEU06	PHU01	PHU04
Text LLM
GPT-4.1-mini	67.2	15.6	0.2	12.7	67.0	44.0	16.1	24.0	52.7
Gemini2.5-Flash	64.1	16.3	1.5	12.4	67.6	60.9	5.8	20.7	64.8
DeepSeek-V3	1.1	12.3	0.0	5.8	40.2	59.3	13.6	28.9	50.7
VL LLM
GPT-5-mini	65.7	18.9	0.8	17.9	67.6	30.4	13.3	21.4	47.8
Gemini2.5-Flash	61.6	15.2	0.9	8.3	72.5	64.1	11.6	22.7	59.0
Intern-S1-Pro	98.0	75.9	20.8	88.3	99.5	65.6	71.3	36.8	93.2

5.5 Specializable Generalist Could Be Better: Biology Case Study

Both models were trained on the same underlying dataset, with Intern-S1-Pro upgrading the data to feature more fluent text expression. The results reveal that a larger, more general foundation model extracts and utilizes the same specialized data more effectively.

Table 4: Comparison between Biology-Instruction (specialized) and Intern-S1-Pro on biological sequence tasks.
Dataset	Biology-Instruction	Intern-S1-Pro
DNA-cpd	44.54	54.60
DNA-emp	8.10	14.02
DNA-pd	58.18	82.65
DNA-tf-h	24.45	54.11
DNA-tf-m	39.91	60.80
Multi_sequence-antibody_antigen	10.26	44.76
Multi_sequence-promoter_enhancer	4.77	-1.30
Multi_sequence-rna_protein_interaction	74.26	58.51
DNA-enhancer_activity	53.28	55.16
RNA-CRISPROnTarget	3.77	15.69
Protein-Fluorescence	2.57	78.14
Protein-Stability	60.25	60.82
Protein-Thermostability	45.07	59.56
RNA-Isoform	59.01	82.95
RNA-MeanRibosomeLoading	47.64	52.41
RNA-ProgrammableRNASwitches	26.65	33.97
RNA-Modification	59.06	57.77
Protein-Solubility	63.02	67.60
RNA-NoncodingRNAFamily	63.09	34.50
Protein-FunctionEC	19.79	72.70
Multi_sequence-sirnaEfficiency	56.31	62.05
AVG score	39.24	52.45

Intern-S1-Pro outperforms the specialized Biology-Instruction model by +13.2 avg — trained on the same data, demonstrating that trillion-scale generalist training enables superior domain mastery through synergistic intelligence.

Why does a bigger generalist beat a smaller specialist?

Conventional wisdom says specialized models should outperform generalists in niche domains — they overfit less on irrelevant data. But this result challenges that: Intern-S1-Pro, trained on both biology and general data, significantly outperforms the dedicated Biology-Instruction model (39.24 vs 52.45 avg) on the same biological tasks using the same underlying data. The likely reason: at trillion-scale, the model has enough capacity to master the full diversity of patterns in a domain without the smaller model's "forgetting" tradeoffs. Cross-domain transfer (physics → protein folding, chemistry → molecular biology) also helps. The paper calls this "Specializable Generalist" — general enough to transfer, large enough to specialize.

6. Conclusion

In this report, we introduced Intern-S1-Pro, a trillion-parameter scientific multimodal foundation model designed to advance the frontiers of AI in scientific discovery. Building upon the strong foundation of Intern-S1, we scaled the model through a novel expert expansion strategy combined with Grouped Routing. This architectural innovation not only ensures efficient load balancing across devices but also significantly enhances training stability, mitigating the risks of expert homogenization and training instability often observed in large-scale MoE models.

To further bolster the model's scientific understanding, we conducted continued pre-training on 6T tokens of high-quality multimodal data. A critical component of this process was the development of a specialized caption pipeline tailored for scientific imagery, generating precise, alignment-focused captions for scientific figures that substantially improved the model's ability to interpret complex scientific visual content.

Our extensive evaluations demonstrate that Intern-S1-Pro achieves state-of-the-art performance across a wide range of scientific benchmarks, exhibiting robust reasoning capabilities and deep domain knowledge. Moving forward, we aim to further expand the model's capabilities into more specialized scientific domains for the acceleration of scientific discovery.

1T MoE Architecture

First trillion-parameter scientific multimodal model with Grouped Routing for absolute load balance and stable large-scale training.

Scientific Caption Pipeline

270B tokens of high-quality scientific image–text data from PDF corpora with domain-aware captioning via InternVL3.5 and CapRL.

FP8 RL at Trillion Scale

Stable mixed-precision RL matching BF16 quality, enabling efficient reinforcement learning at trillion-parameter MoE scale.

Specializable Generalist

Outperforms specialized domain models while maintaining top-tier general capabilities — scale enables superior mastery of specialized tasks.