This is a Plain English Papers summary of a research paper called LiveVQA: AI Challenges Answering Visual Questions with Real-Time Knowledge. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

LiveVQA: Building a Benchmark for Live Visual Knowledge

Introduction

LiveVQA introduces a new benchmark dataset that tests AI systems on their ability to answer questions requiring the latest visual knowledge from the internet. Unlike text-based knowledge problems which have seen significant advances through search-integrated language models, visual knowledge questions remain challenging for current AI systems.

The researchers created LiveVQA with three key design principles: strict temporal filtering to prevent dataset contamination, automated ground truth with human verification, and high-quality authentic image-question pairs. The dataset contains 1,233 news articles with 3,602 visual questions across 14 domains, sourced from six major global news platforms.

Each instance in LiveVQA contains a representative image paired with three types of questions: a basic visual understanding question (Level 1) and two multi-hop questions requiring deeper reasoning and contextual knowledge (Levels 2-3).

LiveVQA comprises 14 different News categories, containing 1233 News and 3602 question-answer pairs. Each instance includes a representative image, QA pair for basic image for understanding, and two multimodal multi-hop QA pairs for deeper reasoning.
LiveVQA comprises 14 different News categories, containing 1233 News and 3602 question-answer pairs.

The LiveVQA Dataset

The dataset provides comprehensive coverage across 14 news categories and 6 major global news sources. It features high-quality image-text coherence and authentic information, with each instance carefully designed to test different levels of visual reasoning ability.

Pipline of LiveVQA data engine. Our pipeline consists of three modules: news collector, data filter, and Q&A pairs builder. It collects illustrated news from mainstream media, performs multi-level data filtering, and generates foundational and detailed Q&A pairs for training multimodal question-answering models.
Pipeline of LiveVQA data engine showing the three main modules: news collector, data filter, and Q&A pairs builder.

The distribution of news instances shows meaningful patterns across categories and sources:

Category Overall By News Source (%)
Count % VRTY BBC CNN APNWS FORB YHO
Sports 305 24.8 1.0 48.8 20.3 7.5 15.5 0.0
Other 219 17.8 1.0 17.3 25.3 28.4 13.6 30.0
Movies 102 8.3 36.7 0.7 1.7 6.0 5.8 0.0
TV 89 7.2 31.0 1.8 2.1 2.5 4.9 5.0
Science 80 6.5 0.0 5.5 7.1 16.9 0.0 20.0
Economy 72 5.8 0.0 4.4 7.9 8.0 14.6 10.0
Health 67 5.4 1.0 6.6 3.3 12.4 1.0 5.0
Media 58 4.7 7.6 3.1 7.5 3.5 1.9 5.0
Music 47 3.8 11.9 2.0 0.8 3.0 4.9 0.0
G.Business 45 3.7 1.9 1.8 7.5 2.5 6.8 15.0
Tech 45 3.7 2.4 2.6 4.2 3.0 10.7 5.0
Opinion 45 3.7 1.0 2.4 8.3 2.5 5.8 5.0
Art/Design 43 3.5 0.0 2.4 4.2 4.0 13.6 0.0
Theater 15 1.2 4.8 0.9 0.0 0.0 1.0 0.0
Total 1,232 100 210 457 241 201 103 20
Source % 100 17.1 37.1 19.6 16.3 8.4 1.6

Table 1. The distribution of 1,232 news instances across 14 categories and 6 major sources, containing 3,602 VQA.

Dataset Construction

The LiveVQA dataset construction involved three main steps:

  1. Data Collection: The researchers selected six global news platforms (CNN, BBC, Yahoo, Forbes, AP News, and Variety) for comprehensive geographic coverage and content diversity. The collection process included URL normalization to filter out advertisements and indexes, structured content extraction using site-specific CSS selectors, and image filtering to prioritize content-relevant images.

  2. Raw Data Filtering: To ensure dataset quality, a multi-level filtering mechanism was implemented, covering URL validation, image screening, and duplicate removal. This involved website-specific filters, quality prioritization for images, and hierarchical deduplication techniques.

  3. Question-Answer Generation: GPT-4o was employed to generate QA pairs from raw news documents. Each sample comprises an image reflecting the news topic, a basic question about image content, and two multi-hop detailed questions requiring cross-modality reasoning. The basic questions focus on substantive elements like people, objects, or locations, while avoiding queries solely reliant on visual properties like color. The multi-hop questions require deeper contextual reasoning and must be answerable only through the news text.

This careful construction process ensures high-quality visual question answering data that can properly test models' capabilities with latest visual knowledge.

Data Statistics

The final dataset contains 1,232 news items spanning 14 categories and 6 global news platforms, amounting to 3,602 QA pairs. The dataset shows distinct domain specificity - sports news is the most prevalent category, with a significant portion sourced from BBC. News sources also exhibit clear domain preferences: Variety primarily covers film and music, Forbes focuses on business, and AP News emphasizes science and health.

This diverse coverage makes the dataset ideal for comprehensive evaluation of visual question answering capabilities across different domains.

Experiments and Analysis

Experiment Setups

The researchers conducted zero-shot testing on 15 state-of-the-art multimodal large language models (MLLMs), including Gemini 2.0 Flash, Qwen2.5-VL series (3B to 72B), Gemma-3 series (4B to 27B), QVQ-72B-Preview, QVQ-Max, GPT-4o-mini, and GPT-4o.

The evaluation also tested models with search engine capabilities, enabling built-in search functionalities and MM-Search with Gemini-2.0-Flash, GPT-4o-mini, and GPT-4o. For evaluation, GPT-4o-mini was instructed as an impartial judge to provide binary (yes/no) correctness assessments.

Experiment Results

The evaluation revealed several important findings:

Model Avg. L1 L2 L3 Per. Loc. Tim. Eve. Org. Obj. Rea. Oth
w.o. Search
Gemma-3-4b-it 14.65 38.42 3.10 2.46 19.20 11.96 2.82 14.51 26.75 28.37 2.89 10.26
Gemma-3-12b-it 17.10 44.19 3.47 3.71 23.96 15.78 5.08 15.95 29.40 29.58 2.69 12.25
Gemma-3-27b-it 20.43 48.50 7.93 4.92 29.19 17.77 2.82 20.50 34.46 35.21 5.17 15.23
Qwen2.5-VL-3B 15.63 39.98 4.58 2.38 25.65 13.29 3.11 12.98 28.67 27.89 2.89 5.30
Qwen2.5-VL-7B 18.74 41.28 7.44 3.63 29.43 17.61 3.07 16.89 30.23 33.82 2.87 10.67
Qwen2.5-VL-32B 18.96 47.93 5.12 3.88 27.19 17.61 2.82 17.54 33.49 35.21 4.75 8.61
Qwen2.5-VL-72B 21.07 55.93 5.94 1.35 32.87 20.60 4.52 19.59 35.66 32.96 3.51 12.25
GPT-4o 16.38 41.02 4.54 3.62 2.61 21.43 5.08 18.68 28.67 41.97 6.20 15.23
GPT-4o-mini 17.30 43.71 4.95 3.19 5.84 21.93 3.67 20.05 32.53 41.13 6.20 13.58
Gemini-2.0-Flash 24.93 58.81 8.75 5.86 43.01 20.93 4.24 19.36 35.66 43.10 6.61 19.54
QVQ-72B-Preview 19.94 39.90 11.62 7.41 21.81 19.44 2.25 19.95 34.46 36.52 10.33 13.58
QVQ-Max 17.80 38.10 9.50 4.91 24.88 17.94 3.67 15.95 33.01 26.76 4.34 11.59
w. Search
GPT-4o 13.38 28.43 5.78 5.34 2.46 13.79 3.39 18.45 22.17 34.93 6.61 13.91
GPT-4o-mini 22.27 32.58 19.49 14.22 12.14 21.26 11.58 26.42 34.22 42.25 15.50 23.51
Gemini-2.0-Flash 29.46 59.63 16.43 11.03 44.85 25.91 11.58 24.60 44.10 45.63 9.92 23.51
w. MM Search [12]
GPT-4o 20.20 34.88 15.57 9.32 8.18 20.75 16.98 25.97 34.48 52.38 8.33 10.64
GPT-4o-mini 21.80 41.28 14.97 8.07 24.55 24.53 7.55 16.88 31.03 47.62 10.00 17.02
Gemini-2.0-flash 29.00 49.42 22.75 13.66 44.55 26.42 20.75 20.78 29.31 42.86 11.67 27.66

Table 2. Overall performance on LiveVQA. See Table 5 for performance on another categorizing taxonomy for live visual knowledge.

Key findings include:

  1. Larger models perform better: Within model families like Gemma or Qwen, increasing model size leads to consistently better accuracy across all question difficulty levels. For example, Gemma-3-4b-it achieves only 2.46% on L3-level questions, while Gemma-3-27b-it reaches 4.92%.

  2. Strong visual reasoning helps with complex questions: Models with stronger reasoning capabilities, such as QvQ-72B-Preview, outperform their base models on multi-hop questions, highlighting the importance of visual reasoning abilities in live knowledge tasks.

Model Avg. L1 L2 L3
Qwen2.5-VL-72B 21.07 55.93 5.94 1.35
Short 21.52 40.78 12.05 8.96
Midium 19.94 39.90 11.62 7.41
Long 21.71 41.75 12.47 10.09

Table 3. Performance comparison between Qwen2.5-VL-72B [35] and QvQ-72B-Preview [24] with different thinking pattern via prompt engineering.

  1. Search engines significantly improve performance: Integrating MM-Search with Gemini-2.0-Flash boosts its average accuracy to 29.00%, with substantial gains on harder questions (22.75% on L2 and 13.66% on L3), demonstrating that retrieval-based evidence is particularly helpful for questions beyond models' internal knowledge.

  2. Entity-centric vs. abstract knowledge: Models perform better on concrete entity recognition tasks (Person, Organization, Object) but struggle with abstract knowledge like Time and Reason, indicating limitations in causal reasoning and temporal understanding.

  3. Domain-specific performance patterns: Models achieve higher performance in domains with rich visual and textual cues but perform poorly in ambiguous or subjective domains like Opinion.

The research also explored whether providing image captions could improve performance on more complex questions:

Provided Info. Avg. L1 L2 L3
Image 17.30 43.71 4.95 3.19
Image Caption 14.83 34.77 5.28 3.62

Table 4. Performance of GPT-4o-mini on LiveVQA using image and image description as input. We find that model perform better in L2 and L3 problems when provided image caption.

This demonstrates that for more complex reasoning tasks (L2 and L3), having the image caption helps models perform better, despite lower performance on basic visual understanding questions (L1).

These findings highlight the importance of multimodal inference capabilities in solving visual question answering tasks that require up-to-date knowledge.

Error Analysis

To understand the limitations of current models, the researchers conducted a comprehensive analysis of failure cases, categorizing errors into several types:

  1. Recognition Error: Failure to correctly identify visual elements like people, objects, locations, or text
  2. Reasoning Error: Inability to perform deeper reasoning, including inference, causality, or temporal understanding
  3. Ambiguous Answer: Responses too generic or indirect to address the question accurately
  4. Privacy Restriction: Refusal to answer due to privacy or safety constraints, despite the question being valid
  5. Judge Error: Correct answers mistakenly marked as incorrect during evaluation
  6. Others: Errors not covered by the above categories, including system failures

These error categories provide valuable insights into the specific challenges that models face when dealing with visual question answering tasks requiring latest knowledge.

Related Works

The LiveVQA research builds on three key areas of previous work:

Live Knowledge Seeking: Recent advances in LLMs as search engines have shown significant progress, especially when integrated with retrieval-augmented generation (RAG) techniques. While tools like Search GPT have effectively combined RAG with online search for textual queries, multimodal search engines are still evolving to provide richer experiences integrating text, images, and other information types.

Live Visual Knowledge: Visual knowledge encompasses both factual information from external sources and commonsense understanding about objects and their interactions. Recent developments are expanding from static data to "live" visual content, addressing challenges in understanding current news, emerging events, and temporally relevant information.

News QA: News Question Answering has evolved from purely text-based approaches to NewsVQA, which integrates textual and visual information. This field faces unique challenges in multimodal fusion, temporal information processing, and scene text understanding.

Conclusion

LiveVQA provides a comprehensive benchmark for evaluating multimodal large language models on live visual knowledge tasks. The extensive experiments across 15 models reveal that while larger models generally perform better, significant challenges remain in addressing complex multi-hop visual questions that require up-to-date knowledge.

The research demonstrates that equipping models with online search tools or GUI-based image search leads to substantial performance improvements, particularly for challenging questions requiring contextual understanding and reasoning.

Despite these advances, even the best models achieve only modest accuracy (29.46% for Gemini-2.0-Flash with search), highlighting that live visual knowledge seeking remains an open challenge requiring further research.

Limitations

The researchers acknowledge several limitations of their study:

  1. The dataset primarily structures latest information into visual question answering formats, while additional synthetic data approaches like image captioning or Chain-of-Thought reasoning could further enhance models' capabilities.

  2. The research predominantly sources content from mainstream news websites like CNN and BBC, potentially leading to imbalance and incomplete representation of current internet content. Incorporating data from social media platforms could provide a more diverse dataset.

  3. More robust visual search tools need to be developed to enhance model performance on latest visual knowledge queries.

These limitations point to promising directions for future research in this rapidly evolving field.

Click here to read the full summary of this paper