MedHal: New Dataset Flags Medical AI Lies - Can AI Detect False Health Info?

This is a Plain English Papers summary of a research paper called MedHal: New Dataset Flags Medical AI Lies - Can AI Detect False Health Info?. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

MedHal: A New Dataset for Detecting Medical Hallucinations in AI Systems

Hallucinations in AI-generated content present significant challenges, especially in the medical domain where inaccurate information can have serious consequences. While current hallucination detection methods work reasonably well for general content, they fall short when applied to specialized domains like medicine.

Researchers from Polytechnique Montreal and Mila have introduced MedHal, a large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. This resource addresses critical gaps in existing medical hallucination datasets, which are typically too small or too narrowly focused on single tasks.

The Critical Need for Better Medical Hallucination Detection

Current approaches to hallucination detection in medicine rely heavily on expert review—a process that is both costly and time-consuming. Existing evaluation datasets suffer from significant limitations: they either contain only a few hundred samples (insufficient for training models) or focus on narrow tasks like Question Answering.

MedHal addresses these gaps through three key innovations:

Incorporating diverse medical text sources and tasks
Providing a substantial volume of annotated samples suitable for training
Including explanations for factual inconsistencies to guide model learning

The researchers demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing significant improvements over general-purpose hallucination detection approaches.

How MedHal Transforms Medical Datasets

The researchers developed a unified approach to transform existing medical datasets from various tasks—question answering, information extraction, natural language inference, and summarization—into a standardized hallucination detection framework.

Each sample in MedHal follows a consistent structure:

Statement: A proposition to be classified as factual or non-factual
Context (optional): Contextual information relevant to the statement
Factual label: A binary indicator (Yes/No)
Explanation: For non-factual statements, details about the inconsistency

Task	Datasets	Sample	Generated Statement
Information Extraction	Augmented- Clinical Notes	A 10-year-old girl first noted a swollen left knee and underwent repeated arthrocentesis. She underwent arthroscopic surgery and was diagnosed with ... $\rightarrow$ age: 10 years old	The patient is 10 years old.
Summarization	SumPubMed	the large genotyping studies in the last decade have revolutionize genetic studies. our current ability to ... $\rightarrow$ genetic admixture is a common caveat for genetic association analysis. these results...	genetic admixture is a common caveat for genetic association analysis. these results...
NLI	MedNLI	Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4. $\rightarrow$ Patient has normal Cr	Patient has normal Cr
QA	MedQA, MedMCQA	Which of the following medications is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications? $\rightarrow$ Metformin	Metformin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.

Examples of how statements are generated from samples across different task types

Creating Hallucination Examples from Medical Questions

For question-answering datasets, the researchers transformed questions and their correct answers into factual statements using a large language model. To create hallucinated samples, they paired questions with incorrect answers and converted these into statements.

Sample Type	Question	Answer	Generated Statement
Factual	Which of the following medications is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.	Metformin	Metformin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.
Non-Factual		Insulin	Insulin is most commonly used as first-line treatment for newly diagnosed type 2 diabetes mellitus in patients without contraindications.

Example of Question-Answering Dataset Transformation

Leveraging Clinical Information Extraction

For information extraction datasets (like clinical notes), the researchers used source documents as context and corresponding extractions as statements. To generate non-factual samples, they randomly interchanged extractions between different documents.

Sample Type	Source Document	Extraction	Statement	Explanation
Factual	A 10-year-old girl first noted a swollen left knee and underwent repeated arthrocento- sis...	age: 10 years old	The patient is 10 years old	-
Non-Factual	A 10-year-old girl first noted a swollen left knee and underwent repeated arthrocente- sis...	age: 16 years old	The patient is 16 years old	The patient is 10 years old

Example of Information Extraction Dataset Transformation

Detecting Small Errors in Medical Summaries

For summarization datasets, the researchers used genuine summaries as factual samples. To create hallucinated samples, they extracted a sentence from the original summary and modified it to introduce contradictory information.

Sample Type	Source Document	Summary	Statement	Explanation
Factual	a central feature in the maturation of hearing is a transition in the electrical signature of cochlear hair cells from spontaneous calcium...	cochlear hair cells are high-frequency sensory receptors...	cochlear hair cells are high-frequency sensory receptors...	-
Non-Factual			cochlear hair cells are low-frequency sensory receptors...	According to the source document, cochlear hair cells are high-frequency sensory receptors...

Example showing how hallucinations are introduced into medical summaries

Dataset Composition and Scale

MedHal incorporates a diverse range of datasets spanning multiple tasks and content types. The resulting dataset includes over 348,000 samples, significantly larger than existing medical hallucination datasets.

Dataset	Task	S	Content Type	Source	# Samples	# Gen
MedMCQA	QA	$\boldsymbol{x}$	Medical Content	Pal et al. (2022)	183,000	70,730
MedNLI	NLI	$\boldsymbol{x}$	Clinical Notes	Herlihy & Rudinger (2021)	11,232	7,488
ACM	IE	$\checkmark$	Clinical Notes	Bonnet & Boulenger (2024)	22,000	73,040
MedQA	QA	$\boldsymbol{x}$	Medical Content	Jin et al. (2020)	12,723	18,906
PubMedSum	Sum	$\boldsymbol{x}$	Clinical Trials	Gupta et al. (2021)	33,772	178,657

Overview of datasets used to generate MedHal (S indicates whether the source dataset is synthetic)

Evaluating Model Performance on MedHal

The researchers evaluated several models on MedHal, including general-purpose LLMs and medical-specific models. They also fine-tuned a small (1 billion parameter) Llama model on the MedHal dataset to create a baseline detector.

Two types of metrics were used to evaluate performance:

Factuality metrics: precision, recall, and F1-score
Explanation metrics: BLEU and ROUGE scores to evaluate the quality of explanations

FT	Type	Models	Factuality			Explanation
			P	R	F1	BLEU	R1	R2
$\boldsymbol{x}$	General	Llama-1B-Instruct	0.53	0.32	0.40	0.01	0.13	0.03
		Llama-8B-Instruct	0.59	0.62	0.60	0.05	0.24	0.14
	Medical	BioMistral-7B	0.56	0.43	0.49	0.03	0.22	0.11
		MedLlama-8B	0.52	0.59	0.55	0.03	0.21	0.08
		OpenBioLLM-8B	0.52	0.77	0.62	0.04	0.21	0.10
$\checkmark$	General	MedHal-Llama (Ours)	0.75	0.77	0.76	0.45	0.70	0.59

Results of various models on the MedHal test set (FT: Fine-Tuned, P: Precision, R: Recall)

Accuracy of models based on dataset source

Performance comparison across different source datasets in MedHal

Surprising Insights from Model Performance

The evaluation revealed several unexpected findings. General-purpose models demonstrated surprising aptitude for identifying medical inaccuracies, with Llama-8B-Instruct achieving the highest precision (0.59) among non-fine-tuned models.

In contrast, specialized medical models—despite extensive domain-specific training—often performed worse at hallucination detection. Only OpenBioLLM-8B demonstrated relatively strong performance with an F1-score of 0.62, though it showed higher recall than precision, suggesting a tendency to overclassify statements as factual.

Another unexpected finding was that general-purpose models outperformed medical models in generating explanations for hallucinations. Manual evaluation revealed this was partly because medical models often failed to follow instructions to explain their judgments.

Task-Specific Performance Variations

Performance varied significantly across different data sources and tasks. Information extraction proved challenging for all models—even the fine-tuned MedHal-Llama performed only slightly better than random chance. The researchers hypothesize this might be due to the synthetic nature of the information extraction dataset.

Interestingly, the researchers found no significant performance difference between tasks requiring contextual analysis (like summarization) and tasks operating on isolated statements (like question-answering). This suggests that the presence or absence of context doesn't substantially impact hallucination detection capability.

The Impressive Impact of Fine-Tuning

One of the most striking findings was the performance of the fine-tuned MedHal-Llama model. Despite having only 1 billion parameters, this model significantly outperformed much larger models (7-8B parameters), including those with extensive medical domain fine-tuning.

The MedHal-Llama model achieved a 0.76 F1-score on factuality detection—substantially higher than the best non-fine-tuned model (0.62). It also excelled at generating explanations for hallucinated content, with explanation metrics 5-10x higher than other models.

This finding suggests that specialized medical fine-tuning alone doesn't guarantee strong performance in hallucination detection. A smaller model specifically trained on hallucination detection can outperform larger, more generally trained medical models.

Limitations and Future Directions

The researchers acknowledge several limitations of their approach. Most significantly, MedHal relies on generated content, with validity assessed primarily through visual inspection of samples. Ideally, comprehensive evaluation by medical professionals would further validate the dataset's accuracy and clinical relevance.

Additionally, the dataset is imbalanced across different tasks, with summarization samples over-represented. Future work should focus on balancing the dataset to ensure more equitable evaluation.

Advancing Medical AI Safety

MedHal represents a significant step forward in addressing the critical challenge of hallucination detection in medical AI. By providing a large-scale, diverse dataset that spans multiple medical text types and tasks, it enables more effective training and evaluation of hallucination detection models.

The impressive performance of the fine-tuned MedHal-Llama model demonstrates that even relatively small models can achieve strong performance when specifically trained on this task. This approach could significantly reduce the cost and effort associated with medical AI research while improving safety.

As the field of medical hallucination detection continues to evolve alongside other domain-specific approaches like vision-based medical hallucination detection, resources like MedHal will play a crucial role in developing more reliable and trustworthy medical AI systems.

Click here to read the full summary of this paper

MedHal: New Dataset Flags Medical AI Lies - Can AI Detect False Health Info?

MedHal: A New Dataset for Detecting Medical Hallucinations in AI Systems

The Critical Need for Better Medical Hallucination Detection

How MedHal Transforms Medical Datasets

Creating Hallucination Examples from Medical Questions

Leveraging Clinical Information Extraction

Detecting Small Errors in Medical Summaries

Dataset Composition and Scale

Evaluating Model Performance on MedHal

Surprising Insights from Model Performance

Task-Specific Performance Variations

The Impressive Impact of Fine-Tuning

Limitations and Future Directions

Advancing Medical AI Safety

Comments (0)

Read More

#reading

#popular

MedHal: New Dataset Flags Medical AI Lies - Can AI Detect False Health Info?

MedHal: A New Dataset for Detecting Medical Hallucinations in AI Systems

The Critical Need for Better Medical Hallucination Detection

How MedHal Transforms Medical Datasets

Creating Hallucination Examples from Medical Questions

Leveraging Clinical Information Extraction

Detecting Small Errors in Medical Summaries

Dataset Composition and Scale

Evaluating Model Performance on MedHal

Surprising Insights from Model Performance

Task-Specific Performance Variations

The Impressive Impact of Fine-Tuning

Limitations and Future Directions

Advancing Medical AI Safety

Comments (0)

Read More

How to Travel

DAY 1 / 200.

How JavaScript works behind the scene

Web Scraping with Python: Scraping Data Using Requests and lxml

#reading

#popular