This is a Plain English Papers summary of a research paper called Single Transformer Beats Modular Vision-Language Models in New Study. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

The Simplicity Revolution in Vision-Language Models

Multimodal Large Language Models (MLLMs) have seen remarkable progress, but most follow a modular design: a pre-trained vision encoder extracts image features, a language model processes text, and a lightweight projector aligns these modalities. While effective, this approach fragments multimodal processing and limits deployment flexibility.

A promising alternative has emerged - eliminating the visual encoder entirely and processing raw image patches and text tokens within a single Transformer. This approach, exemplified by SAIL (Single trAnsformer model for vIsion and Language), removes modality-specific modules, enabling parameter sharing and end-to-end learning of vision-language interactions.

Data scaling curve for Modular Multimodal Large Language Model (MLLM) and Sail, our Single Transformer-based MLLM. As pretraining data increases, the single transformer Sail shows a sharper performance gain, demonstrating its superior data scalability (left). Comparison to existing Single Transformer-based MLLMs: Sail pushes the performance boundaries on both vision tasks and vision-language tasks (right).
Data scaling curve showing SAIL's superior performance gains with increased training data (left) and comparison with other Single Transformer MLLMs showing SAIL's strong performance on both vision and vision-language tasks (right).

SAIL's analysis uncovers striking advantages of Single Transformer architectures:

  1. Superior Data Scaling: SAIL exhibits steeper performance gains as pretraining data scales. While modular MLLMs initially perform better, the gap narrows substantially with more data.

  2. Vision-Centric Information Flow: Analysis of attention distributions shows Single Transformers assign significantly higher attention scores to image tokens during prediction, indicating a more direct visual information pathway.

  3. Strong Visual Representation: The pretrained Single Transformer inherently serves as a powerful vision encoder, performing impressively on vision-centric tasks like classification and segmentation.

Current Landscape of Vision-Language Models

Paradigms in Vision-Language Model Design

The two primary approaches to MLLMs have distinct advantages and limitations:

Modular MLLMs with Visual Encoders rely on pretrained vision encoders like CLIP-ViT or InternViT to process visual inputs. While this enables effective transfer of visual knowledge, it introduces several drawbacks:

  • Separate vision encoders slow down training and inference
  • Integration of visual features into LLMs remains challenging
  • Balancing interactions between components becomes increasingly complex at scale

Single Transformer-based MLLMs Without Visual Encoders process raw image patches and text tokens through a unified architecture. These models fall into two categories:

  • Continuous tokenization (Fuyu-8B, SOLO): Directly maps patches to LLM embeddings via linear projections
  • Discrete tokenization (Chameleon, Emu3): Uses VQ-VAE tokenizers to compress images into discrete tokens

Despite innovations from models like EVE and Mono-InternVL, critical gaps remain in understanding the scaling laws and fundamental properties of purely end-to-end trained models.

Vision Representation Learning

Training effective visual representations with language data typically follows three paradigms:

  1. Text as Classification Labels: Uses textual descriptions to extract categorical supervision (Tag2Text, RAM, CatLIP)

  2. Image-Text Contrastive Learning: Aligns global image-text embeddings within a shared latent space (CLIP, ALIGN)

  3. Text as Autoregressive Targets: Treats caption generation as a pretext task for visual representation learning (SimVLM, CapPa)

SAIL aligns with the third category but removes architectural fragmentation by jointly modeling image patches and text tokens in a single Transformer. The approach shows that the resulting model learns transferable vision representations that work well on both multimodal tasks and pure vision tasks.

SAIL: A Single Transformer for Vision and Language

Model Architecture

SAIL uses a unified Transformer architecture that processes both images and text without separate encoders. For text, raw input is tokenized and embedded via the language model's embedding module. For images, input is partitioned into patches and projected into embeddings via linear projection. Special tokens ( and ) mark the boundaries of image patch spans.

Model architecture and micro-designs for Sail showing the unified architecture, mixed attention mechanism, and multimodal rotary position embedding.
SAIL's unified architecture processes both images and text without extra modules (A), uses a mixed attention mechanism (B) allowing bidirectional attention for image patches, and implements multimodal rotary position embeddings (C) for improved spatial modeling.

Two key innovations distinguish SAIL:

  1. Bidirectional attention within image patches: Unlike previous Single Transformer approaches that use only causal attention, SAIL implements a mixed attention scheme:

    • Causal attention for text tokens (preserving autoregressive generation)
    • Full bidirectional attention for image tokens (enabling holistic spatial relationships)
  2. Multimodal Rotary Position Embeddings (M-RoPE): Decomposes positional encoding into height and width axes, with text tokens sharing uniform position IDs and image tokens adaptively mapping to coordinates.

Pretraining Strategy

SAIL applies a two-stage curriculum:

  1. Accelerated Visual Knowledge Acquisition: Pretraining on large-scale image-text pairs with lower resolution (224×224) to maximize data throughput. Pure text corpora are interleaved to prevent losing language capabilities.

  2. Enhancing Any-Resolution Image Understanding: Extended pretraining with an any-resolution strategy where images retain their native resolutions and positional embeddings adapt dynamically.

Stage Dataset Img.Res Num
Pretraining S1 Recap-DataComp-1B [43]
SlimPajama [66]
224x224 512M
Pretraining S2 Capfusion [84]
OCR from LAION COCO [63]
InifinityMM Stage 2 subset [29]
SlimPajama [66]
AnyRes 60M
7M
19M
SFT InifinityMM Stage3 [29] AnyRes 6M

Details of training datasets used across all stages, showing data sources, image resolution settings, and sample counts.

The pretraining objective uses standard language modeling loss only on text tokens, excluding image patches and special visual tokens from loss computation.

Supervised Fine-tuning

During Supervised Fine-tuning (SFT), SAIL trains on publicly available instruction datasets to enhance understanding of complex linguistic instructions and diverse dialogue patterns. This phase fine-tunes the entire network, focusing on aligning the model's responses with human intent through exposure to varied instructional formats and multimodal interactions.

Experimental Results: Simplicity Meets Performance

Experimental Settings

SAIL evaluates across a broad range of benchmarks:

  • Vision-language: MMBench-EN, SEEDBench-IMG, MMVet, MME, HallusionBench, MathVista, OCRBench
  • Visual question answering: TextVQA, ScienceQA-IMG, AI2D, MMStar, RealWorldQA
  • Vision representation: ImageNet-1K, ADE20K, ARO

For pretraining, SAIL initializes from Mistral-7B-v0.1 with patch size 14, using 128 NVIDIA A100 GPUs. The supervised fine-tuning process uses a global batch size of 512 for one epoch.

Vision-Language Task Performance

SAIL consistently outperforms other Single Transformer-based models across diverse vision-language tasks:

Method #Param #Data #Vtoken General VQA Hallucination Math&knowledge OCR VQA
MMS* MMB ${ }^{\text {en }}$ SEED $^{1}$ MMV MME RWQA POPE Hallu $\mathrm{SQA}^{1}$ MathV TQA AI2D OCRB
Modular MLLMs:
InternVL-1.5 [13] 2.2B $-/-$ 3328 46.7 70.9 69.8 39.3 1902 57.9 88.3 37.3 84.9 41.3 70.5 69.8
QwenVL-Chat [3] 7B 7.2B / 50M 256 34.5 60.6 58.2 - 1848 49.3 - 36.8 68.2 35.3 61.5 45.9
LLaVA-1.5 [47] 7B 0.4B+ / 665K 576 33.1 64.3 64.3 30.5 1859 54.8 85.9 27.6 66.8 25.5 46.1 54.8
LLaVA-1.6 [48] 7B 0.4B+ / 760K 2880 37.6 67.4 64.7 43.9 1842 57.8 86.4 27.6 70.2 32.5 64.9 66.6
Cambrian-1 [69] 8B 10B+ / 7M 576 50.7 75.9 74.7 - - 64.2 - 30.6 80.4 48.1 71.7 73.0
LLaVA-OneVision [41] 7B 10B+ / 3.2M 7290 60.9 81.7 74.8 58.8 1998 65.5 - - 96.6 56.1 - 81.6
Single Transformer-based MLLMs:
Fuya [5] 8B $-/-$ - 34.4 10.7 59.3 21.4 - 43.7 84 29.8 56.8 30.2 - 46.8
Chameleon [67] 7B 1.4B+ / 1.8M 1024 31.1 31.1 30.6 8.3 170 39 19.4 17.1 47.2 22.5 4.8 46.0
EVE [21] 7B 33M / 1.8M 2304 - 52.3 64.6 25.7 1628 - 85.0 - 64.9 - 56.8 61.0
SOLO [11] 8B 43.7M / 2M 1024 35.8 67.7 64.4 30.4 1260 44.7 78.6 40.4 73.3 32.9 25.0 61.4
Mono-InternVL [54] 3B 1.3B / 7M 6400 - 65.5 67.4 40.1 1875 - - 45.7 93.6 45.7 72.6 68.6
Emu3 [79] 8B $/-$ 16K 46.6 58.5 68.2 37.2 - 57.4 85.2 31.7 89.2 31.3 64.7 70.0
EVE2 [22] 7B 92M / 7.3M 2500 - 66.3 71.4 45.0 1709 62.4 87.6 - 96.2 - 71.1 74.8
SAIL 7B 600M / 6M 3600 53.1 70.1 72.9 46.3 1719 63.9 85.8 54.2 93.3 57.0 77.1 76.7

Comparison with existing vision-language models on various vision-language benchmarks, including MMStar, MMBench-EN, SEEDBench-Img, MMVet, MME, POPE, HallusionBench, ScienceQA-Img, TextVQA, MathVistaMINI, AI2D, RealWorldQA, and OCRBench. Note that #A-Param denotes the number of activated parameters; #Data represents the pre-training / fine-tuning data volume; #Vtoken indicates the maximum image patch tokens.

Compared to the state-of-the-art modular MLLM LLaVA-OneVision, SAIL achieves comparable performance on several benchmarks including MMStar, SEEDBench-IMG, and RealWorldQA. This demonstrates that scaling up single-transformer pretraining effectively enhances cross-modal alignment between images and text.

Vision Representation Learning Performance

SAIL demonstrates exceptional performance on vision tasks compared to other single transformer models:

Method Classification Segmentation
Top-1 Top-5 mIoU mAcc aAcc
EVE [21] 42.03 65.77 27.12 35.89 72.91
EVE2 [22] 44.86 69.41 40.85 53.53 79.31
SOLO [11] 59.10 80.89 35.11 44.81 76.02
SAIL 84.95 97.59 55.30 67.24 84.87

Comparison on image classification and semantic segmentation with other encoder-free approaches. Our SAIL outperforms other alternatives by a large margin.

Even when compared with specialized vision backbones, SAIL achieves competitive performance with significantly less training data:

Method #Data #Param ImageNet-1K ADE20K
OpenCLIP-H [15] 2B 0.6B 84.4 -
OpenCLIP-G [15] 2B 1.8B 86.2 $39.3^{1}$
ViT-22B [19] 3B 22B $\mathbf{89.5}$ 55.3
InternViT [14] 6B 6B 88.2 $\mathbf{58.7}$
SAIL $\mathbf{0.5B}$ 7B 85.0 55.3

Comparison on image classification and semantic segmentation with other vision backbones. Despite using significantly less training data, SAIL achieves competitive performance.

For additional vision-related tasks, Pixel-SAIL explores SAIL's downstream capabilities in pixel-grounded understanding.

Properties of Single Transformer Models

Model Scaling: As model size increases from 0.5B to 7B parameters, performance improves consistently across vision-language tasks:

Model scaling of Sail. Left: As the model size increases, the training language modeling loss gradually decreases. Right: As the model size increases, performance on downstream VLM tasks progressively improves.
As SAIL's model size increases from 0.5B to 7B parameters, training loss decreases (left) and performance on downstream vision-language tasks improves (right).

Data Scaling: Single Transformer models show superior data scaling properties compared to modular MLLMs. While modular MLLMs perform better with limited data, SAIL exhibits a steeper performance curve as data scales, eventually achieving comparable performance with 512M image-text pairs.

Information Flow Pattern: Single Transformer models allocate significantly more attention to image tokens during prediction:

Image Attention Score Allocation: The figure shows the proportion of image attention scores across different transformer layers for Single Transformer-based MLLM and modular MLLM when predicting tokens. Single Transformer-based MLLM generally allocates higher attention weights to image tokens compared to modular MLLM.
Single transformer models like SAIL allocate significantly higher attention to image tokens during prediction compared to modular MLLMs, indicating more direct visual information flow.

Single Transformer-based MLLMs allocate between 60% and 80% of attention scores to image tokens across all layers when predicting tokens, compared to just 10% to 30% in modular MLLMs. This indicates a more vision-centric approach to multimodal reasoning.

Empirical Observations on Basic Factors

Ablation studies identify key design choices for effective single-transformer MLLMs:

Exp. Setting VQAv2 GQA SQA TQA SEED-I
Default 59.1 46.9 59.6 20.1 35.1
#1 No Img full attn 57.8 45.2 58.7 16.2 33.8
#2 No pure text in PT 56.3 42.1 48.6 18.3 32.4

Ablation Study on Basic Factors for Sail: This table presents the impact of different ablation settings on the performance of Sail across various benchmarks. The default setting includes image full attention and the inclusion of pure text data in pretraining.

  1. Bidirectional Attention for Image Patches with Multimodal Position Encoding: Enabling full bidirectional attention for image patches paired with multimodal rotary position embeddings significantly improves performance, with a notable gain of 3.1% on TextVQA.

  2. Interleaving Pure Text Data During Pretraining: Incorporating text data consistently improves performance across benchmarks, underscoring the importance of preserving language capabilities when training Single Transformer models.

The Future of Single Transformer MLLMs

The extensive analysis of SAIL demonstrates that Single Transformer-based MLLMs offer promising advantages over modular designs. By scaling both training data and model size, this unified architecture achieves performance comparable to modular MLLMs while functioning effectively as a vision backbone.

Key advantages include superior data scalability, vision-centric information flow, and strong visual representation capabilities. Even with less pretraining data than modular approaches, the single-transformer design shows competitive performance on both vision-language and pure vision tasks.

These findings suggest a promising future for simpler, more integrated multimodal architectures. As research continues to refine and enhance Single Transformer designs, we may see these streamlined models become the standard approach for multimodal intelligence, offering better scaling properties and more direct cross-modal interactions than their modular counterparts.

The research on SAIL opens new avenues for exploring how unified architectures can effectively learn from multimodal data without relying on pretrained components, potentially leading to more flexible and powerful vision-language systems.

Click here to read the full summary of this paper