This is a Plain English Papers summary of a research paper called New Dataset: Boost Chinese Speech Recognition with Lip-Reading & Slides!. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Enhancing Speech Recognition With Multiple Visual Cues

Automatic Speech Recognition (ASR) systems face numerous challenges in real-world scenarios, from background noise to specialized terminology. While incorporating visual modalities has proven beneficial, most existing approaches use either lip-reading information or contextual visuals, but rarely both. The new Chinese-LiPS dataset addresses this gap by integrating both lip-reading and presentation slides to create a comprehensive multimodal approach to speech recognition.

The research introduces both a high-quality dataset and a simple yet effective pipeline called LiPS-AVSR. This combination leverages lip movements and slides to achieve remarkable improvements in ASR performance. By combining these complementary visual sources, the system achieves a 35% improvement in recognition accuracy compared to audio-only approaches.

Dataset Speakers Duration(h) Language Lip-reading Video Contextual-cues(Slides/Video) Year Available
LRW [12] $1000+$ - English $\checkmark$ - 2018 Y
LRS [13] - - English $\checkmark$ - 2017 N
LRS2-BBC [3] - 200 English $\checkmark$ - 2018 Y
LRS3-TED [14] - 400 English $\checkmark$ - 2018 Y
LRW-1000 [15] $2000+$ 57 Chinese $\checkmark$ - 2018 Y
CMLR [16] - - Chinese $\checkmark$ - 2019 Y
CN-Celeb-AV [17] 1,136 669 Chinese $\checkmark$ - 2023 Y
CN-CVS [18] 2,557 $300+$ Chinese $\checkmark$ - 2023 Y
How2 [19] - 2000 Eng&Por - $\checkmark$ (instructional videos) 2018 Y
VisSpeech [7] - 0.6 English - $\checkmark$ (Youtube videos) 2022 Y
SlideSpeech [9] - $1000+$ English - $\checkmark$ (presentation slides) 2023 Y
SlideAVSR [8] 220 36 English - $\checkmark$ (Youtube videos) 2024 Y
AVNS [20] - $30+$ English - $\checkmark$ (background scenes) 2024 Y
3-Equations [21] - 25.2 English $\checkmark$ $\checkmark$ (3-lines math formula) 2024 Y
Chinese-LiPS(ours) 207 100 Chinese $\checkmark$ $\checkmark$ (presentation slides) 2024 Y

Comparison of Existing AVSR Datasets, Highlighting Lip-Reading and Contextual Information Availability

Unlike previous datasets like XLAVs-R, which focuses on cross-lingual challenges, Chinese-LiPS is specifically designed to combine multiple visual sources for Chinese speech recognition.

The Power of Lip-Reading in AVSR

Lip-reading provides one of the most accessible visual modalities for Audio-Visual Speech Recognition (AVSR). It captures articulation details like mouth movements and naturally synchronizes with audio timing. This makes it particularly valuable for understanding speech in challenging conditions.

Several English datasets have pioneered this approach, including LRW from television shows and the LRS series from BBC broadcasts and TED talks. For Chinese, datasets like CMLR and CN-CVS have been gathered from news broadcasts and diverse real-world settings.

Researchers have developed multiple approaches to leverage these datasets. Auto-AVSR and Whisper-Flamingo use feature fusion and cross-attention to integrate lip-reading with speech data. AV-HuBERT employs self-supervised learning to unify audio and visual information, significantly enhancing recognition performance. Advanced techniques like those in Target Speaker Lipreading by Audio-Visual Self-Attention further demonstrate how visual attention mechanisms can improve recognition of specific speakers.

Beyond Lips: Semantic Visual Context for Speech Understanding

While lip-reading focuses on articulation, semantic visual contextual cues provide complementary information that can further enhance ASR. These cues include presentation slides, background scenes, and other visual context that help disambiguate spoken content.

Several datasets have explored this approach. SlideSpeech uses textual and graphical content from slides to provide semantic context. How2 and AVNS leverage background scenes from the environment. VisSpeech and SlideAVSR, collected from video-sharing platforms, have also proven effective for improving ASR performance.

Current methods primarily use OCR to detect text within contextual elements, treating this as keywords to enhance recognition. Some approaches, like VisPer, employ pre-trained multimodal models like CLIP and DALL-E to extract semantic information from visual cues. However, most existing work has limited integration of multiple visual modalities. For example, one dataset combines speech, lip-reading, and visual context, but restricts the visual components to simple 3-line math formulas.

The Chinese-LiPS Dataset: A New Resource for Multimodal Speech Recognition

Building a High-Quality Multimodal Dataset

Chinese-LiPS represents the first high-quality multimodal Chinese AVSR dataset that integrates both lip-reading information and presentation slides. It contains approximately 100 hours of data with 36,208 clips from 207 speakers, with each clip including speech, slide video, and lip-reading video.

Unlike datasets built from web-scraped content, Chinese-LiPS features presentation slides created by domain experts to ensure content accuracy. The slides are carefully designed to avoid large blocks of text, with speakers going beyond merely reading the slides to deliver comprehensive explanations. This approach aligns with SynesLM research showing that integrated multimodal inputs yield superior recognition results.

Overview of slide styles and themes across different topics in the Chinese-LiPS dataset: we display examples from eight specific topics and the content included in each, while the 'Others' topic covers diverse topics such as dance, fashion, cuisine, photography, etc.

Overview of slide styles and themes across different topics in the Chinese-LiPS dataset

Distribution of total recording duration by topic.

Distribution of total recording duration by topic

Breaking Down the Data: Diversity and Balance

The dataset covers nine diverse topic categories: E-sports & Gaming, Automobile & Industry, Travel & Exploration, Sport & Competitions, Culture & History, Science & Technology, Movies & TV Series, Health & Wellness, and Others. These topics were selected based on their popularity on social media platforms and their inclusion of domain-specific terminology.

The Chinese-LiPS dataset maintains a balanced representation across several dimensions. The majority of speakers are professionals between 20-30 years old. Recording devices are primarily desktop computers (75%) and laptops (25%). Gender representation is balanced to minimize potential biases. Speech segments average 10 seconds in length, with none exceeding 30 seconds.

The dataset is split into 80% training, 15% testing, and 5% validation sets, with balanced gender and age distribution across subsets. Importantly, there is no overlap of speakers between different sets.

Split Duration(h) Segments Speakers M:F Topics
Train 85.37 30341 175 $1: 1.16$ 9
Test 10.12 3908 21 $1: 1.10$ 9
Validation 5.35 1959 11 $1: 0.83$ 6
All 100.84 36208 207 $1: 1.13$ 9

Split details of Chinese-LiPS Dataset, M:F represents the ratio of male to female speakers

Testing the Hypothesis: Experimental Framework

LiPS-AVSR: A Pipeline for Multimodal Speech Recognition

The LiPS-AVSR pipeline leverages both Whisper and Whisper-Flamingo as its backbone. Whisper is a Transformer-based end-to-end model with strong speech recognition capabilities, while Whisper-Flamingo extends this by incorporating lip-reading information through AV-Hubert.

The pipeline extracts information from multiple modalities:

  • PaddleOCR extracts textual information from slides
  • InternVL2 captures semantic content from images and graphics
  • AV-HuBERT extracts lip-reading features

These features are then integrated into Whisper's prompt format, creating a comprehensive input that enhances transcription accuracy.

LiPS-AVSR pipeline.

LiPS-AVSR pipeline illustration

Experiment Setup: Rigorous Testing Framework

The data preprocessing includes several specific steps for each modality:

  • Speech is processed at a 16 kHz sampling rate
  • Lip regions are detected and resized to 96×96 resolution at 25 frames per second
  • For slides, the first frame is extracted for OCR and semantic analysis

Experiments were conducted using the Whisper large-v2 model across two main settings:

  1. Speech-only: Using the original Whisper model fine-tuned with Chinese-LiPS training data
  2. Speech + Lip-reading: Fine-tuning the gated cross-attention layer of Whisper-Flamingo

For each setting, four prompt configurations were tested:

  1. No prompt
  2. OCR-extracted text only
  3. InternVL2-extracted semantic information only
  4. Combined OCR and InternVL2 features

The best-performing speech-only model was selected as the baseline for comparison.

Evaluation Method: Character Error Rate

The experiments used Character Error Rate (CER) as the evaluation metric, calculated as:

CER = (S + D + I) / N

Where S, D, and I represent substitution, deletion, and insertion errors, and N is the total number of characters in the reference. This metric is particularly appropriate for Chinese speech recognition, as it directly measures character-level accuracy.

Results: Confirming the Multimodal Advantage

The experimental results demonstrate the significant benefits of integrating multiple modalities. The original Whisper model with only speech achieved a CER of 3.99%. Adding visual information from slides (OCR and InternVL2) improved performance substantially, with their combination reducing CER to 2.99%—a 25% improvement.

Lip-reading features contributed an 8% performance improvement on their own. When all modalities were combined (speech, lip-reading, OCR, and InternVL2), the CER dropped to 2.58%, representing a 35% overall improvement.

ID Speech Lip-reading OCR InternVL2 CER(\%) $\downarrow$
1(baseline) $\checkmark$ 3.99
2 $\checkmark$ $\checkmark$ 3.37
3 $\checkmark$ $\checkmark$ 3.33
4 $\checkmark$ $\checkmark$ $\checkmark$ 2.99
5 $\checkmark$ $\checkmark$ 3.69
6 $\checkmark$ $\checkmark$ $\checkmark$ 2.74
7 $\checkmark$ $\checkmark$ $\checkmark$ 3.05
8 $\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$ 2.58

Performance Comparison of Whisper-Large-V2 on Chinese-LiPS

Error correction examples using slide and lip-reading information. (A) Lip-reading mitigates hesitation and filler errors, while slide data addresses domain-specific terms. (B) OCR fails to capture visual cues, but InternVL2 effectively extracts meaningful context.

Error correction examples using slide and lip-reading information

Understanding the Benefits: Error Analysis

Detailed error analysis reveals that different modalities address specific types of errors:

  • Lip-reading primarily reduces deletion errors, decreasing them from 1697 to 509—a substantial improvement. This aligns with the hypothesis that lip-reading conveys articulation-related information, helping recover filler words, hesitation markers, and incomplete speech segments.

  • Presentation slides contribute significantly to reducing substitution errors, dropping from 3851 to 3531. They also help address some deletion errors, especially for domain-specific terms. This demonstrates how slides provide semantic and contextual information crucial for recognizing specialized vocabulary and proper nouns.

The SynesLM approach similarly found that multimodal information sources complement each other to provide more robust speech recognition across varying conditions.

ID Modality Substitution $\downarrow$ Deletion $\downarrow$ Insertion $\downarrow$
1 Speech only 3851 1697 437
4 Speech + Slides $3531 \downarrow$ 447 $\downarrow$ 510
5 Speech + Lip 4499 $509 \downarrow$ 522
8 Speech + Lip + Sli $3047 \downarrow$ $335 \downarrow$ 484

Error Analysis Across Different Modalities in the Test Set: Total Number of Chinese Characters = 150,059

Conclusion: The Future of Multimodal Speech Recognition

The Chinese-LiPS dataset and LiPS-AVSR pipeline demonstrate that combining lip-reading and presentation slide information can significantly enhance speech recognition performance. This approach leverages the complementary strengths of different visual modalities: lip-reading captures articulation details, while slides provide semantic context and domain-specific terminology.

The research confirms that these modalities don't just provide redundant information—they address different types of errors. Lip-reading helps recover missing words and filler content, while slides improve accuracy for specialized terms that might otherwise be misinterpreted.

As speech recognition systems continue to evolve, this type of multimodal integration represents a promising direction for creating more robust and accurate solutions, particularly for challenging real-world applications like educational content, lectures, and presentations.

The Chinese-LiPS dataset will enable further research into multimodal speech recognition for Chinese language applications, filling an important gap in available resources for this widely-spoken language.

Click here to read the full summary of this paper