Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
🔗 Model on HuggingFaceAbstract
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences.
Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
What does "1 trillion parameters" actually mean?
Model parameters are the learned weights that encode all the knowledge and reasoning ability of a neural network. GPT-3 had 175 billion parameters; current frontier models are in the hundreds of billions. 1 trillion (1012) parameters is roughly 5–10× the size of typical large models. However, Intern-S1-Pro uses a Mixture-of-Experts (MoE) architecture where only a fraction of parameters are "active" during any single inference — labeled "1T-A22B" meaning 1T total but only ~22B active per forward pass. This gives the representational capacity of a trillion-parameter dense model at a fraction of the compute cost.
1. Introduction
The advent of Large Language Models (LLMs) and Visual Language Models (VLMs) has fundamentally transformed the landscape of artificial intelligence, offering unprecedented capabilities in reasoning, generation, and multimodal understanding. In the domain of AI for Science (AI4S), these foundation models have emerged as critical tools for accelerating scientific discovery, enabling researchers to tackle complex problems ranging from protein structure prediction to materials design.
To build an effective scientific foundation model, scaling model size is imperative due to the immense diversity inherent in scientific domains. Compared to natural language, science encompasses much more specialized fields — each with its own unique "language", including domain-specific notations, knowledge, and reasoning patterns. A scientific foundation model should possess sufficient capacity to master a wide array of scientific tasks while retaining general text and vision capabilities.
In this work, we introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Following the three-layer SAGE (Synergistic Architecture for Generalizable Experts) framework, we demonstrate that a sufficiently large generalist model, when trained jointly on general and specific tasks, can outperform specialized models in several scientific tasks — contrary to the common belief that specialized models are superior for niche tasks.
At the engineering level, we achieve deep optimization between the XTuner training framework and the LMDeploy inference engine, allowing Intern-S1-Pro to scale to 4× the size of its predecessor (Intern-S1) while incurring only a ~20% reduction in training efficiency.
2. Architecture
Intern-S1-Pro is derived from Intern-S1 through expert expansion, incorporating a Grouped Routing design to ensure stable trillion-scale MoE training.
2.1 Group Routing
For the training of ultra-large-scale MoE models, Expert Parallelism (EP) serves as the core technical approach to mitigate GPU memory and communication overheads. However, the expert load imbalance caused by the traditional Top-k routing strategy leads to cross-device load imbalance during expert parallel training.
Mixture-of-Experts (MoE) and the load balancing problem
In a standard Transformer, every token passes through the same FFN weights. In a MoE model, each layer has multiple parallel "expert" FFNs, and a learned router sends each token to only the top-k experts. This lets you increase model capacity without proportionally increasing compute. The problem: if the router always sends tokens to the same few popular experts, those GPU devices get overloaded while others sit idle. Traditional Top-K routing causes this imbalance. Grouped Router fixes it by partitioning experts into groups and selecting exactly one expert per group — guaranteeing each device gets an equal workload regardless of input distribution.
We propose to replace the traditional Top-K Router with the Grouped Router to achieve absolute load balancing across devices under the 8-way expert parallelism training strategy. In Grouped Router architecture, all experts are uniformly partitioned into G mutually disjoint groups. For each group g, only the top-(K/G) experts with the highest scores are selected within the group.
Combined with the configuration of the Intern-S1-Pro 1T model (k = 8) and the EP8 training strategy, we divide all experts into 8 groups and select the Top-1 expert within each group, achieving absolute load balancing across devices. This approach not only significantly improves training efficiency but also fundamentally eliminates the OOM risk during training.
2.2 Straight-Through Estimator for Sparse Expert Routing
MoE architectures scale model capacity by routing each input token to a small subset of K out of N experts via Top-K selection. The layer output is:
Why is the Straight-Through Estimator needed here?
The Top-K selection operation (picking the highest-scoring experts) is not differentiable — it's a discrete argmax, so gradients cannot flow back through it to update the router's weights. This means the router could get "frozen" and fail to learn. The Straight-Through Estimator (STE) is a trick: during the forward pass, use the hard discrete selection; during the backward pass, pretend it was a soft continuous function (a temperature-scaled softmax) and compute gradients through that instead. This allows all router parameters to receive gradient updates every step, keeping the router trainable at scale.
We introduce the Straight-Through Estimator (STE) to decouple the forward and backward passes of the routing operation. The STE routing weight is constructed as:
where piτ = softmax(z/τ)i is a scaled routing probability and sg(·) is the stop-gradient operator. The gradient of any loss L with respect to logit zj is:
Through STE, the router receives consistent data-driven feedback throughout training, enabling all router embeddings to be updated in every pass.
2.3 Vision Encoder
Intern-S1-Pro employs a Native Vision Transformer (ViT) as the vision encoder. The encoder processes images at native resolution, where the visual token count depends on the original input resolution rather than a fixed image size. Visual tokens extracted from the ViT pass through a multilayer perceptron (MLP) projector that maps visual features into the embedding space of the language model.
The training of the encoder uses contrastive learning with large-scale image–text pairs spanning approximately 300 million image–text pairs, including CC12M, LAION-COCO, SBU Caption, LAION-2B-Multi, and Wukong.
2.4 FoPE — Fourier Position Encoding
Traditional positional encoding methods such as RoPE (Rotary Position Embedding) impose a particle-like representation on all modalities, treating information as localized, discrete units. This creates a representational gap for physical signals (images, audio, video) that inherently have wave-interference and spectral properties.
FoPE (Fourier Position Encoding) addresses this limitation by reimagining how transformer models encode position and structure — treating each dimension as a Fourier series of different frequency components, separating information more effectively and mitigating spectral damage. Inadequately trained frequency components are clipped to remove their harmful influence.
RoPE vs FoPE: positional encoding for multimodal signals
Rotary Position Embedding (RoPE) is the dominant positional encoding in modern LLMs (LLaMA, GPT-4, etc.). It encodes token position by rotating embedding vectors, and handles text sequences well. But images, audio, and scientific signals have wave-like, spectral structure — a pixel's meaning depends on its 2D neighborhood in ways that 1D sequential rotation doesn't model well. FoPE treats each embedding dimension as a sinusoidal frequency component (like a Fourier series), which is a more natural representation for spatially or temporally structured data. The "clipping" step removes frequency components that weren't trained enough and would add noise.
2.5 Time-series Encoder
Time series is a core scientific data modality, capturing temporal evolution of complex processes. The time series module of Intern-S1-Pro features an adaptive subsampling module that partitions continuous signals into local segments (patches), captures local dynamics within each patch, and models long-range dependencies across segments. The number of temporal frames is kept within a controllable range by adaptively determining patch size and stride based on the signal and its sampling rate.
The enhanced module expands coverage to: physiological signal analysis (EEG-based depression detection), marmoset vocalization recognition, and electrocardiography abnormality monitoring, handling sequences from 100 to 106 time steps.
3. Pre-training
Intern-S1-Pro employs a total of 6T tokens of image-text and text data for continued pre-training, with a key upgrade in caption data tailored for scientific images.
3.1 Caption Pipeline
Scientific images from web sources suffer from brief, low-alignment captions. In contrast, PDFs represent the primary carrier of scientific visual content, containing high-information-density figures including experimental results, statistical plots, structural diagrams, and formula derivations.
Why scientific caption quality matters so much
Multimodal models learn to connect visual features to text by training on image–text pairs. If the caption for a plot says "Figure 3" but doesn't explain what the axes mean or what trend to observe, the model learns nothing from that figure. Web-scraped scientific images are especially bad — captions are often just figure numbers or short titles. This paper builds a specialized pipeline to extract sub-figures from PDFs and generate dense, 1000-word captions using domain-expert models (InternVL3.5-241B for science figures, CapRL-32B for others), then filters with a text quality discriminator. The result: 270B tokens of rich scientific image–text data that would otherwise not exist.
We independently constructed a large-scale PDF data production pipeline: extracting sub-figures from massive PDF corpora using MinerU 2.5 for layout analysis, precise deduplication via perceptual hashing (pHash), topic classification and model routing (scientific images → InternVL3.5-241B; non-scientific → CapRL-32B), and a 0.5B-parameter text quality discriminator for filtering. The result: approximately 270B tokens of high-quality scientific image–text caption data.
3.2 Resolving Conflicts Between Scientific and Textual Data
Directly mixing scientific data (structured, high logical determinism) with general data (semantic depth, linguistic diversity) can lead to distribution shift and negative transfer. Intern-S1-Pro adopts three strategies:
Structured Scientific Data Transformation
Heterogeneous scientific input-output pairs from databases like PubChem are converted to grammatically correct, narrative text via Template Construction and Task Form Transformation, aligning scientific data with the representation style of general data.
Scientific Data Diversification
Prompt Diversification and the Rollout mechanism prevent overfitting on repetitive scientific sequences (e.g., protein sequences). By combining scientific prior knowledge with a strong base model to generate complete reasoning chains, knowledge recall is transformed into logical deduction.
System Prompt Isolation
Mutually exclusive system-level prefixes are injected for scientific and general data during the training cycle, creating independent contextual processing environments for the model. This reduces data conflicts and improves model stability.
4. Post-Training
4.1 Stable Mixed-Precision Reinforcement Learning for Sparse MoE Models
FP8 vs BF16: why mixed precision is tricky for RL
Neural networks normally train in BF16 (16-bit bfloat) to save memory while preserving enough numeric precision. FP8 (8-bit float) halves memory again but is much more numerically sensitive. This matters for RL because the training engine (XTuner) and the rollout/inference engine (LMDeploy) must produce identical probabilities for the same inputs — any mismatch accumulates as policy divergence that can cause training instability. The paper's key contribution here is a suite of fixes: aligning numerically sensitive ops (RMSNorm, softmax) between the two engines, replaying expert routing decisions so the same experts are selected in training as in rollout, and using importance sampling corrections for any remaining policy mismatch.
Scaling RL to trillion-parameter MoE models presents formidable memory challenges. With Intern-S1-Pro featuring 4× the expert count of Intern-S1, we adopt FP8 quantization for the RL phase but implement a comprehensive stabilization framework to preserve performance:
- Operator-level precision alignment: Reduced precision gaps between LMDeploy rollout engine and XTuner training engine in numerically sensitive components (RMSNorm, router softmax, positional embedding).
- Rollout router replay: Expert routing indices are recorded per layer during rollout and replayed during policy updates to enforce expert selection consistency.
- Targeted mixed-precision: Expert linear layers quantized to FP8; non-expert components kept in BF16; FP32 LM head for log-probability fidelity.
- Dual importance sampling: Modified REINFORCE objective with training-inference mismatch calibration and off-policy bias correction.
RL Objective
The modified REINFORCE loss with dual importance sampling:
Masking function: M(ρi,t; α,β) = ρi,t if α < ρi,t < β, else 0
Advantage estimate: Âi,t = Ri − bi, bi = (1/(G−1)) ∑j≠i Rj
5. Evaluation
5.1 Evaluation Configuration
| Thinking | Non-Thinking | |
|---|---|---|
| max tokens | 65,536 | 32,768 |
| temperature | 0.8 | 0 |
| top_p | 0.95 | 1.0 |
| top_k | 50 | 1 |
5.2 Benchmarks
5.3 Main Results
Intern-S1-Pro firmly demonstrates highly competitive capabilities with the first tier of open-source models. In scientific evaluations, it significantly outperforms proprietary models like Gemini-3-Pro and GPT-5.2 on multiple benchmarks.
| Intern-S1-Pro 1T-A22B |
Qwen3-VL-235B 235B-A22B |
Kimi-K2.5 1T-A32B |
GPT-5.2 | Gemini-3-Pro | ||
|---|---|---|---|---|---|---|
| Scientific Tasks | ||||||
| SciReasoner | Scientific Reasoning | 55.5 | 11.9 | 15.3 | 13.6 | 14.7 |
| SFE | Scientific Multimodal Tasks | 52.7 | 41.4 | 53.7 | 47.5 | 58.9 |
| SmolInstruct | Small Molecule | 74.8 | 36.6 | 53.5 | 48.2 | 58.3 |
| MatBench | Materials Property Prediction | 72.8 | 49.7 | 60.0 | 53.6 | 64.9 |
| Mol-Instructions | Bio-molecular Instruction | 48.8 | 8.9 | 20.0 | 12.3 | 34.6 |
| MicroVQA | Biological Microscopy | 63.3 | 53.8 | 55.4 | 60.4 | 69.0 |
| Biology-Instruction | Multi-Omics Sequence | 52.5 | 6.2 | 10.7 | 10.2 | 12.0 |
| XLRS-Bench | Remote Sensing | 52.8 | 51.2 | 46.4 | 50.4 | 51.8 |
| MSEarth-MCQ | Earth Science | 65.2 | 52.7 | 61.9 | 62.6 | 65.8 |
| General Tasks | ||||||
| MMMU-Pro | Knowledge & Reasoning | 72.8 | 69.9 | 78.5 | 79.5 | 81.0 |
| MMLU-Pro | Knowledge & Reasoning | 86.6 | 83.4 | 87.1 | 85.9 | 89.3 |
| AIME-2025 | Math Reasoning | 93.1 | 90.0 | 96.1 | 100.0 | 95.0 |
| IMO-Answer-Bench | Math Reasoning | 77.3 | 72.3 | 81.8 | 86.3 | 81.3 |
| RefCOCO-avg | Visual Grounding | 91.9 | 91.1 | 87.8 | 54.9 | 76.2 |
| IFBench | Instruction Following | 71.2 | 58.7 | 69.7 | 75.4 | 70.4 |
| SArena (Icon) | SVG Generation | 83.5 | 76.3 | 78.5 | — | 82.6 |
| LCB V6 | Code | 74.3 | 72.0 | 85.0 | 87.7 | 86.9 |
| GAIA (Text-Only) | Agent | 77.4 | 47.8 | 79.9 | 71.1 | 75.5 |
| τ²-Bench | Agent | 80.9 | 57.4 | 76.8 | 76.6 | 85.4 |
| ScreenSpot V2 | Agent & Grounding | 93.6 | 92.8 | 92.4 | 49.4 | 94.7 |
5.4 Time Series Results
Intern-S1-Pro exhibits significantly superior performance over both Text LLMs and Vision-Language LLMs across diverse scientific time series tasks on the SciTS benchmark, validating the effectiveness of the dedicated time series encoder and dynamic subsampling process.
| ASU01 | ASU03 | BIU01 | BIU03 | EAU01 | MEU01 | NEU06 | PHU01 | PHU04 | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Text LLM | ||||||||||
| GPT-4.1-mini | 67.2 | 15.6 | 0.2 | 12.7 | 67.0 | 44.0 | 16.1 | 24.0 | 52.7 | |
| Gemini2.5-Flash | 64.1 | 16.3 | 1.5 | 12.4 | 67.6 | 60.9 | 5.8 | 20.7 | 64.8 | |
| DeepSeek-V3 | 1.1 | 12.3 | 0.0 | 5.8 | 40.2 | 59.3 | 13.6 | 28.9 | 50.7 | |
| VL LLM | ||||||||||
| GPT-5-mini | 65.7 | 18.9 | 0.8 | 17.9 | 67.6 | 30.4 | 13.3 | 21.4 | 47.8 | |
| Gemini2.5-Flash | 61.6 | 15.2 | 0.9 | 8.3 | 72.5 | 64.1 | 11.6 | 22.7 | 59.0 | |
| Intern-S1-Pro | 98.0 | 75.9 | 20.8 | 88.3 | 99.5 | 65.6 | 71.3 | 36.8 | 93.2 | |
5.5 Specializable Generalist Could Be Better: Biology Case Study
Both models were trained on the same underlying dataset, with Intern-S1-Pro upgrading the data to feature more fluent text expression. The results reveal that a larger, more general foundation model extracts and utilizes the same specialized data more effectively.
| Dataset | Biology-Instruction | Intern-S1-Pro |
|---|---|---|
| DNA-cpd | 44.54 | 54.60 |
| DNA-emp | 8.10 | 14.02 |
| DNA-pd | 58.18 | 82.65 |
| DNA-tf-h | 24.45 | 54.11 |
| DNA-tf-m | 39.91 | 60.80 |
| Multi_sequence-antibody_antigen | 10.26 | 44.76 |
| Multi_sequence-promoter_enhancer | 4.77 | -1.30 |
| Multi_sequence-rna_protein_interaction | 74.26 | 58.51 |
| DNA-enhancer_activity | 53.28 | 55.16 |
| RNA-CRISPROnTarget | 3.77 | 15.69 |
| Protein-Fluorescence | 2.57 | 78.14 |
| Protein-Stability | 60.25 | 60.82 |
| Protein-Thermostability | 45.07 | 59.56 |
| RNA-Isoform | 59.01 | 82.95 |
| RNA-MeanRibosomeLoading | 47.64 | 52.41 |
| RNA-ProgrammableRNASwitches | 26.65 | 33.97 |
| RNA-Modification | 59.06 | 57.77 |
| Protein-Solubility | 63.02 | 67.60 |
| RNA-NoncodingRNAFamily | 63.09 | 34.50 |
| Protein-FunctionEC | 19.79 | 72.70 |
| Multi_sequence-sirnaEfficiency | 56.31 | 62.05 |
| AVG score | 39.24 | 52.45 |
Why does a bigger generalist beat a smaller specialist?
Conventional wisdom says specialized models should outperform generalists in niche domains — they overfit less on irrelevant data. But this result challenges that: Intern-S1-Pro, trained on both biology and general data, significantly outperforms the dedicated Biology-Instruction model (39.24 vs 52.45 avg) on the same biological tasks using the same underlying data. The likely reason: at trillion-scale, the model has enough capacity to master the full diversity of patterns in a domain without the smaller model's "forgetting" tradeoffs. Cross-domain transfer (physics → protein folding, chemistry → molecular biology) also helps. The paper calls this "Specializable Generalist" — general enough to transfer, large enough to specialize.
6. Conclusion
In this report, we introduced Intern-S1-Pro, a trillion-parameter scientific multimodal foundation model designed to advance the frontiers of AI in scientific discovery. Building upon the strong foundation of Intern-S1, we scaled the model through a novel expert expansion strategy combined with Grouped Routing. This architectural innovation not only ensures efficient load balancing across devices but also significantly enhances training stability, mitigating the risks of expert homogenization and training instability often observed in large-scale MoE models.
To further bolster the model's scientific understanding, we conducted continued pre-training on 6T tokens of high-quality multimodal data. A critical component of this process was the development of a specialized caption pipeline tailored for scientific imagery, generating precise, alignment-focused captions for scientific figures that substantially improved the model's ability to interpret complex scientific visual content.
Our extensive evaluations demonstrate that Intern-S1-Pro achieves state-of-the-art performance across a wide range of scientific benchmarks, exhibiting robust reasoning capabilities and deep domain knowledge. Moving forward, we aim to further expand the model's capabilities into more specialized scientific domains for the acceleration of scientific discovery.
1T MoE Architecture
First trillion-parameter scientific multimodal model with Grouped Routing for absolute load balance and stable large-scale training.
Scientific Caption Pipeline
270B tokens of high-quality scientific image–text data from PDF corpora with domain-aware captioning via InternVL3.5 and CapRL.
FP8 RL at Trillion Scale
Stable mixed-precision RL matching BF16 quality, enabling efficient reinforcement learning at trillion-parameter MoE scale.
Specializable Generalist
Outperforms specialized domain models while maintaining top-tier general capabilities — scale enables superior mastery of specialized tasks.