InternVL3: Unified Multimodal AI Training Outperforms Open-Source Rivals

This is a Plain English Papers summary of a research paper called InternVL3: Unified Multimodal AI Training Outperforms Open-Source Rivals. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Breaking New Ground in Multimodal AI Training

InternVL3 marks a significant advancement in the InternVL model series, implementing a native multimodal pre-training approach that fundamentally transforms how vision-language models learn. Unlike most leading multimodal large language models (MLLMs) that adapt text-only models to handle visual inputs through complex post-hoc alignment, InternVL3 jointly acquires multimodal and linguistic capabilities in a single unified pre-training stage.

Multimodal performance of the InternVL series and other advanced MLLMs. The InternVL series has consistently exhibited progressive enhancements in multimodal capabilities. InternVL3 significantly outperforms existing open-source MLLMs and remains competitive with state-of-the-art closed-source commercial models.

This innovative approach addresses the alignment challenges typically encountered when retrofitting text-only LLMs with visual processing capabilities. Traditional methods require resource-intensive strategies with specialized domain data and intricate parameter-freezing schedules, whereas InternVL3's unified training paradigm offers a more efficient and integrated approach to multimodal learning.

Native Multimodal Pre-Training: A Unified Approach

The core innovation of InternVL3 lies in its native multimodal pre-training strategy. Rather than first training a text-only LLM and then adapting it for visual inputs, InternVL3 learns from both text corpora and multimodal datasets simultaneously during pre-training. This approach enables the model to develop linguistic and visual capabilities in tandem, creating a more seamlessly integrated multimodal architecture.

The pre-training process uses a multimodal autoregressive formulation where visual tokens serve as conditioning context for text prediction. While gradients naturally propagate through all modalities, the loss computation focuses exclusively on text tokens. This training approach helps the model embed visual information in ways that benefit downstream language tasks.

InternVL3 also incorporates advanced position encoding strategies like Variable Visual Position Encoding (V2PE), which assigns different position increments to textual and visual tokens. This innovation helps accommodate longer multimodal contexts without excessively extending the position window.

Architecture and Model Variants

InternVL3 follows the "ViT-MLP-LLM" architecture paradigm used in previous models but introduces several refinements. The model consists of a vision encoder (InternViT), a two-layer MLP for alignment, and a language model (primarily from the Qwen2.5 series).

Model Name	#Param	Vision Encoder	Language Model	OpenCompass Academic
InternVL3-1B	0.9B	InternViT-300M-448px-V2.5	Qwen2.5-0.5B	57.4
InternVL3-2B	1.9B	InternViT-300M-448px-V2.5	Qwen2.5-1.5B	63.9
InternVL3-8B	8.1B	InternViT-300M-448px-V2.5	Qwen2.5-7B	73.3
InternVL3-9B	9.2B	InternViT-300M-448px-V2.5	InternLM3-8B	72.4
InternVL3-14B	15.1B	InternViT-300M-448px-V2.5	Qwen2.5-14B	75.5
InternVL3-38B	38.4B	InternViT-6B-448px-V2.5	Qwen2.5-32B	77.3
InternVL3-78B	78.4B	InternViT-6B-448px-V2.5	Qwen2.5-72B	79.5

Pre-trained models used in the InternVL3 series. The OpenCompass scores for the InternVL3 series were obtained through local testing.

The LLM components are initialized solely from pre-trained base models without employing instruction-tuned variants. For efficiency in processing high-resolution images, InternVL3 incorporates a pixel unshuffle operation, which reduces visual token count to one-fourth of the original value.

Performance of various MLLMs on the OpenCompass multimodal academic leaderboard. InternVL3 demonstrates outstanding capabilities, significantly outperforming both the Qwen2.5-VL series and closed-source models such as Step-1o, GLM-4v-Plus, and GPT-4o. InternVL3-78B remains highly competitive with state-of-the-art Gemini-2.5-Pro.

Advanced Training and Optimization Strategies

InternVL3 employs a comprehensive three-phase training approach:

Native Multimodal Pre-training: The model learns from interleaved multimodal data and text corpora during initial pre-training.
Supervised Fine-Tuning (SFT): Building on pre-training, SFT uses higher-quality and more diverse training data across domains like tool usage, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning.
Mixed Preference Optimization (MPO): This phase addresses the gap between training (ground-truth token conditioning) and inference (model-predicted token conditioning) by introducing supervision from both positive and negative samples.

For inference, InternVL3 leverages test-time scaling with Best-of-N evaluation using VisualPRM-8B as a critic model to select the most effective responses for reasoning and mathematical tasks.

To enhance training efficiency, the team extended the InternEVO framework to support flexible sharding strategies for different model components, resulting in a 50-200% training speedup compared to previous models.

Comprehensive Performance Evaluation

InternVL3 demonstrates exceptional performance across diverse multimodal tasks. On the multidisciplinary reasoning benchmark MMMU, InternVL3-78B achieves a score of 72.2%, setting a new state-of-the-art among open-source MLLMs.

For mathematical reasoning, InternVL3 shows strong results across various benchmarks, with InternVL3-78B recording 79.0% on MathVista, 43.1% on MathVision, and 51.0% on MathVerse.

Model	MMMU	MathVista	MathVision	MathVerse	DynaMath	WeMath	LogicVista	Overall
InternVL3-1B	43.4	45.8	18.8	18.7	5.8	13.4	29.8	25.1
w/ VisualPRM-Bo8 [124]	55.4	62.1	21.7	28.9	13.4	28.5	34.9	35.0
InternVL3-2B	48.6	57.0	21.7	25.3	14.6	22.4	36.9	32.4
w/ VisualPRM-Bo8 [124]	57.8	70.5	26.6	36.7	21.4	38.5	40.5	41.7
InternVL3-8B	62.7	71.6	29.3	39.8	25.5	37.1	44.1	44.3
w/ VisualPRM-Bo8 [124]	66.0	75.2	37.5	46.3	28.5	48.1	49.7	50.2
InternVL3-78B	72.2	79.0	43.1	51.0	35.1	46.1	55.9	54.6
w/ VisualPRM-Bo8 [124]	72.2	80.5	40.8	54.2	37.3	52.4	57.9	56.5
GPT-4o-20241120 [96]	70.7	60.0	31.2	40.6	34.5	45.8	52.8	47.9
Claude-3.7-Sonnet [3]	75.0	66.8	41.9	46.7	39.7	49.3	58.2	53.9
Gemini-2.0-Pro [29]	69.9	71.3	48.1	67.3	43.3	56.5	53.2	58.5

Comparison of multimodal reasoning and mathematical performance. The overall score is the average score of the benchmarks. "w/ VisualPRM-Bo8" denotes that the model is evaluated with Best-of-8 settings, where VisualPRM serves as the critic model.

In OCR, chart, and document understanding tasks, InternVL3 achieves top-tier results, with InternVL3-78B reaching 89.7/96.0% on AI2D, 89.7% on ChartQA, and 95.4% on DocVQA.

Model Name	AI2D (w / wo M)	ChartQA (test avg)	TextVQA (val)	DocVQA (test)	InfoVQA (test)	OCR Bench	SEED-2 Plus	CharXiv (RQ / DQ)	VCR-EN-Easy (EM J Jaccard)	Overall
InternVL3-1B	69.4 / 78.3	75.3	74.1	81.9	53.7	790	58.2	21.0 / 47.1	89.3 / 96.2	68.6
InternVL3-2B	78.7 / 87.4	80.2	77.0	88.3	66.1	835	64.6	28.3 / 54.7	91.2 / 96.9	74.7
InternVL3-8B	85.2 / 92.6	86.6	80.2	92.7	76.8	880	69.7	37.6 / 73.6	94.5 / 98.1	81.3
InternVL3-38B	88.9 / 95.5	89.2	83.9	95.4	85.0	886	71.6	46.4 / 87.2	96.1 / 98.7	85.5
InternVL3-78B	89.7 / 96.0	89.7	84.3	95.4	86.5	906	71.9	46.0 / 85.1	96.0 / 98.6	85.8
GPT-4o-20240513 [96]	84.6 / 94.2	85.7	77.4	92.8	79.2	736	72.0	47.1 / 84.5	91.6 / 96.4	81.6
Claude-3.5-Sonnet [3]	81.2 / 94.7	90.8	74.1	95.2	74.3	788	71.7	60.2 / 84.3	63.9 / 74.7	78.7

Comparison of OCR, chart, and document understanding performance. The table evaluates OCR-related capabilities across 9 benchmarks, including AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED-2-Plus, CharXiv, and VCR.

For multi-image understanding, InternVL3-78B demonstrates strong cross-image reasoning capabilities with scores of 66.3% on BLINK, 79.3% on Mantis-Eval, and 73.2% on MMT-Bench.

The model also excels in video understanding, with InternVL3-78B achieving 72.7/75.7% on Video-MME and 78.7% on MVBench, outperforming many other open-source and closed-source models.

Preserving Strong Language Capabilities

A key achievement of InternVL3 is maintaining robust language capabilities despite its joint multimodal training approach. In fact, InternVL3 models consistently outperform their Qwen2.5 chat counterparts across most language evaluation benchmarks, including MMLU, CMMLU, C-Eval, and GAOKAO.

This enhancement in language capabilities stems from several factors: the integration of approximately 25% pure-language data during multimodal pre-training, joint parameter optimization, and extensive use of high-quality textual corpora during post-training.

Ablation Studies and Analysis

Ablation studies reveal the effectiveness of InternVL3's key components:

Performance comparison on multimodal benchmarks under different training strategies. Native multimodal pre-training endows MLLMs with strong multimodal capabilities, even without further post-training.

Native Multimodal Pre-training: Models using native multimodal pre-training showed performance comparable to fully multi-stage trained baselines, even without further optimization.
Variable Visual Position Encoding: Different position increment values (δ) were tested, showing that even for tasks with short contexts, relatively small δ values can achieve optimal performance.
Mixed Preference Optimization: Models fine-tuned with MPO demonstrated superior reasoning performance compared to counterparts without MPO, with improvements of 4.1 and 4.5 points for the 78B and 38B variants, respectively.

A Step Forward for Open-Source MLLMs

InternVL3 represents a significant advancement in the development of open-source multimodal models. By implementing a native multimodal pre-training approach, incorporating innovations like V2PE, and applying effective post-training techniques, InternVL3 achieves state-of-the-art performance among open-source MLLMs while narrowing the gap with leading proprietary models.

In line with open science principles, the researchers are releasing both the training data and model weights to foster further research and development in next-generation MLLMs. This commitment to transparency and accessibility helps advance the entire field of multimodal AI.

The success of InternVL3 demonstrates the effectiveness of unified training paradigms that jointly optimize for multiple modalities, potentially setting a new direction for the development of future multimodal AI systems.

Click here to read the full summary of this paper

InternVL3: Unified Multimodal AI Training Outperforms Open-Source Rivals

Breaking New Ground in Multimodal AI Training

Native Multimodal Pre-Training: A Unified Approach

Architecture and Model Variants

Advanced Training and Optimization Strategies

Comprehensive Performance Evaluation

Preserving Strong Language Capabilities

Ablation Studies and Analysis

A Step Forward for Open-Source MLLMs

Comments (0)

Read More

#reading

#popular

InternVL3: Unified Multimodal AI Training Outperforms Open-Source Rivals

Breaking New Ground in Multimodal AI Training

Native Multimodal Pre-Training: A Unified Approach

Architecture and Model Variants

Advanced Training and Optimization Strategies

Comprehensive Performance Evaluation

Preserving Strong Language Capabilities

Ablation Studies and Analysis

A Step Forward for Open-Source MLLMs

Comments (0)

Read More

⚛️ Build a Simple Todo App with React Store - a Tiny React State Manager

System Hacking: Journey into the Intricate World of Cyber Intrusion

How to manage large env files?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

#reading

#popular