This is a Plain English Papers summary of a research paper called MedHal: New Dataset Flags Medical AI Lies - Can AI Detect False Health Info?. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

MedHal: A New Dataset for Detecting Medical Hallucinations in AI Systems

Hallucinations in AI-generated content present significant challenges, especially in the medical domain where inaccurate information can have serious consequences. While current hallucination detection methods work reasonably well for general content, they fall short when applied to specialized domains like medicine.

Researchers from Polytechnique Montreal and Mila have introduced MedHal, a large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. This resource addresses critical gaps in existing medical hallucination datasets, which are typically too small or too narrowly focused on single tasks.

The Critical Need for Better Medical Hallucination Detection

Current approaches to hallucination detection in medicine rely heavily on expert review—a process that is both costly and time-consuming. Existing evaluation datasets suffer from significant limitations: they either contain only a few hundred samples (insufficient for training models) or focus on narrow tasks like Question Answering.

MedHal addresses these gaps through three key innovations:

  1. Incorporating diverse medical text sources and tasks
  2. Providing a substantial volume of annotated samples suitable for training
  3. Including explanations for factual inconsistencies to guide model learning

The researchers demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing significant improvements over general-purpose hallucination detection approaches.

How MedHal Transforms Medical Datasets

The researchers developed a unified approach to transform existing medical datasets from various tasks—question answering, information extraction, natural language inference, and summarization—into a standardized hallucination detection framework.

Each sample in MedHal follows a consistent structure:

  • Statement: A proposition to be classified as factual or non-factual
  • Context (optional): Contextual information relevant to the statement
  • Factual label: A binary indicator (Yes/No)
  • Explanation: For non-factual statements, details about the inconsistency
Task Datasets Sample Generated Statement
Information
Extraction
Augmented-
Clinical Notes
A 10-year-old girl first noted a swollen left knee and underwent repeated arthrocentesis. She underwent arthroscopic surgery and was diagnosed with ... $\rightarrow$ age: 10 years old The patient is 10 years old.
Summarization SumPubMed the large genotyping studies in the last decade have revolutionize genetic studies. our current ability to ... $\rightarrow$ genetic admixture is a common caveat for genetic association analysis. these results... genetic admixture is a common caveat for genetic association analysis. these results...
NLI MedNLI Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4. $\rightarrow$ Patient has normal Cr Patient has normal Cr
QA MedQA,
MedMCQA
Which of the following medications is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications? $\rightarrow$ Metformin Metformin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.

Examples of how statements are generated from samples across different task types

Creating Hallucination Examples from Medical Questions

For question-answering datasets, the researchers transformed questions and their correct answers into factual statements using a large language model. To create hallucinated samples, they paired questions with incorrect answers and converted these into statements.

Sample Type Question Answer Generated Statement
Factual Which of the following medications is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications. Metformin Metformin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.
Non-Factual Insulin Insulin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.

Example of Question-Answering Dataset Transformation

Leveraging Clinical Information Extraction

For information extraction datasets (like clinical notes), the researchers used source documents as context and corresponding extractions as statements. To generate non-factual samples, they randomly interchanged extractions between different documents.

Sample Type Source
Document
Extraction Statement Explanation
Factual A 10-year-old
girl first noted
a swollen left
knee and
underwent
repeated
arthrocento-
sis...
age: 10 years old The patient is
10 years old
-
Non-Factual A 10-year-old
girl first noted
a swollen left
knee and
underwent
repeated
arthrocente-
sis...
age: 16 years old The patient is
16 years old
The patient is
10 years old

Example of Information Extraction Dataset Transformation

Detecting Small Errors in Medical Summaries

For summarization datasets, the researchers used genuine summaries as factual samples. To create hallucinated samples, they extracted a sentence from the original summary and modified it to introduce contradictory information.

Sample Type Source
Document
Summary Statement Explanation
Factual a central feature in the maturation of hearing is a transition in the electrical signature of cochlear hair cells from spontaneous calcium... cochlear hair cells are high-frequency sensory receptors... cochlear hair cells are high-frequency sensory receptors... -
Non-Factual cochlear hair cells are low-frequency sensory receptors... According to the source document, cochlear hair cells are high-frequency sensory receptors...

Example showing how hallucinations are introduced into medical summaries

Dataset Composition and Scale

MedHal incorporates a diverse range of datasets spanning multiple tasks and content types. The resulting dataset includes over 348,000 samples, significantly larger than existing medical hallucination datasets.

Dataset Task S Content Type Source # Samples # Gen
MedMCQA QA $\boldsymbol{x}$ Medical Content Pal et al. (2022) 183,000 70,730
MedNLI NLI $\boldsymbol{x}$ Clinical Notes Herlihy & Rudinger (2021) 11,232 7,488
ACM IE $\checkmark$ Clinical Notes Bonnet & Boulenger (2024) 22,000 73,040
MedQA QA $\boldsymbol{x}$ Medical Content Jin et al. (2020) 12,723 18,906
PubMedSum Sum $\boldsymbol{x}$ Clinical Trials Gupta et al. (2021) 33,772 178,657

Overview of datasets used to generate MedHal (S indicates whether the source dataset is synthetic)

Evaluating Model Performance on MedHal

The researchers evaluated several models on MedHal, including general-purpose LLMs and medical-specific models. They also fine-tuned a small (1 billion parameter) Llama model on the MedHal dataset to create a baseline detector.

Two types of metrics were used to evaluate performance:

  • Factuality metrics: precision, recall, and F1-score
  • Explanation metrics: BLEU and ROUGE scores to evaluate the quality of explanations
FT Type Models Factuality Explanation
P R F1 BLEU R1 R2
$\boldsymbol{x}$ General Llama-1B-Instruct 0.53 0.32 0.40 0.01 0.13 0.03
Llama-8B-Instruct 0.59 0.62 0.60 0.05 0.24 0.14
Medical BioMistral-7B 0.56 0.43 0.49 0.03 0.22 0.11
MedLlama-8B 0.52 0.59 0.55 0.03 0.21 0.08
OpenBioLLM-8B 0.52 0.77 0.62 0.04 0.21 0.10
$\checkmark$ General MedHal-Llama (Ours) 0.75 0.77 0.76 0.45 0.70 0.59

Results of various models on the MedHal test set (FT: Fine-Tuned, P: Precision, R: Recall)

Accuracy of models based on dataset source

Performance comparison across different source datasets in MedHal

Surprising Insights from Model Performance

The evaluation revealed several unexpected findings. General-purpose models demonstrated surprising aptitude for identifying medical inaccuracies, with Llama-8B-Instruct achieving the highest precision (0.59) among non-fine-tuned models.

In contrast, specialized medical models—despite extensive domain-specific training—often performed worse at hallucination detection. Only OpenBioLLM-8B demonstrated relatively strong performance with an F1-score of 0.62, though it showed higher recall than precision, suggesting a tendency to overclassify statements as factual.

Another unexpected finding was that general-purpose models outperformed medical models in generating explanations for hallucinations. Manual evaluation revealed this was partly because medical models often failed to follow instructions to explain their judgments.

Task-Specific Performance Variations

Performance varied significantly across different data sources and tasks. Information extraction proved challenging for all models—even the fine-tuned MedHal-Llama performed only slightly better than random chance. The researchers hypothesize this might be due to the synthetic nature of the information extraction dataset.

Interestingly, the researchers found no significant performance difference between tasks requiring contextual analysis (like summarization) and tasks operating on isolated statements (like question-answering). This suggests that the presence or absence of context doesn't substantially impact hallucination detection capability.

The Impressive Impact of Fine-Tuning

One of the most striking findings was the performance of the fine-tuned MedHal-Llama model. Despite having only 1 billion parameters, this model significantly outperformed much larger models (7-8B parameters), including those with extensive medical domain fine-tuning.

The MedHal-Llama model achieved a 0.76 F1-score on factuality detection—substantially higher than the best non-fine-tuned model (0.62). It also excelled at generating explanations for hallucinated content, with explanation metrics 5-10x higher than other models.

This finding suggests that specialized medical fine-tuning alone doesn't guarantee strong performance in hallucination detection. A smaller model specifically trained on hallucination detection can outperform larger, more generally trained medical models.

Limitations and Future Directions

The researchers acknowledge several limitations of their approach. Most significantly, MedHal relies on generated content, with validity assessed primarily through visual inspection of samples. Ideally, comprehensive evaluation by medical professionals would further validate the dataset's accuracy and clinical relevance.

Additionally, the dataset is imbalanced across different tasks, with summarization samples over-represented. Future work should focus on balancing the dataset to ensure more equitable evaluation.

Advancing Medical AI Safety

MedHal represents a significant step forward in addressing the critical challenge of hallucination detection in medical AI. By providing a large-scale, diverse dataset that spans multiple medical text types and tasks, it enables more effective training and evaluation of hallucination detection models.

The impressive performance of the fine-tuned MedHal-Llama model demonstrates that even relatively small models can achieve strong performance when specifically trained on this task. This approach could significantly reduce the cost and effort associated with medical AI research while improving safety.

As the field of medical hallucination detection continues to evolve alongside other domain-specific approaches like vision-based medical hallucination detection, resources like MedHal will play a crucial role in developing more reliable and trustworthy medical AI systems.

Click here to read the full summary of this paper