Medical Hallucination Detection

Abstract

Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting “hard” category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a “not sure” category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

Introduction

We present the Medical Hallucination detection dataset (MedHallu), a comprehensive corpus of 10,000 medical question-answer pairs derived from the established PubMedQA dataset. Each pair is meticulously annotated to distinguish accurate responses from hallucinated content. Furthermore, MedHallu is stratified into easy, medium, and hard detection tiers based on the subtlety of hallucinations, enabling granular evaluation of model capabilities.

Methodology

The proposed methodological framework comprises a three-phase pipeline architected for robust hallucinated sample generation. The pipeline follows a sequential approach: (1) stochastic sampling of potential hallucinated responses based on in-context examples and precise definitions, (2) LLM-based quality filtering mechanisms, (3) correctness checking using bidirectional entailment and LLM prompting, and (4) sequential improvement via TextGrad. Inspired by HaluEval, we finally select the most similar sample generated—based on semantic similarity—if no high-quality sample is identified. This multi-layered approach enables comprehensive identification and evaluation of linguistic hallucinations while minimizing false positives.

Results

Below is a comprehensive table comparing the performance of various LLMs with and without knowledge on the MedHallu dataset (10,000 samples).

Model	Without Knowledge					With Knowledge					Δ Knowledge
Model	Overall F1	Overall P	Easy F1	Med F1	Hard F1	Overall F1	Overall P	Easy F1	Med F1	Hard F1	Δ Knowledge
GPT-4o^*	0.737	0.723	0.844	0.758	0.625	0.877	0.882	0.947	0.880	0.811	0.140
GPT-4o mini	0.607	0.772	0.783	0.603	0.446	0.841	0.820	0.914	0.854	0.761	0.234
Qwen2.5-14B-Instruct	0.619	0.691	0.773	0.611	0.483	0.852	0.857	0.935	0.856	0.769	0.233
Gemma-2-9b-Instruct	0.515	0.740	0.693	0.512	0.347	0.838	0.809	0.918	0.848	0.758	0.323
Llama-3.1-8B-Instruct	0.522	0.791	0.679	0.515	0.372	0.797	0.775	0.880	0.796	0.722	0.275
DeepSeek-R1-Distill-Llama-8B	0.514	0.570	0.589	0.515	0.444	0.812	0.864	0.895	0.794	0.751	0.298
Qwen2.5-7B-Instruct	0.553	0.745	0.733	0.528	0.402	0.839	0.866	0.923	0.832	0.770	0.286
Qwen2.5-3B-Instruct	0.606	0.495	0.667	0.602	0.556	0.676	0.514	0.693	0.677	0.661	0.070
Llama-3.2-3B-Instruct	0.499	0.696	0.651	0.467	0.384	0.734	0.775	0.822	0.723	0.664	0.235
Gemma-2-2b-Instruct	0.553	0.620	0.680	0.524	0.457	0.715	0.786	0.812	0.705	0.631	0.162
Average (General LLMs, w/o GPT-4o)	0.533	0.686	0.674	0.517	0.412	0.784	0.789	0.864	0.781	0.716	0.251
OpenBioLLM-Llama3-8B	0.484	0.490	0.494	0.474	0.483	0.424	0.567	0.438	0.412	0.423	-0.060
BioMistral-7B	0.570	0.518	0.627	0.563	0.525	0.648	0.516	0.652	0.660	0.634	0.078
Llama-3.1-8B-UltraMedical	0.619	0.657	0.747	0.596	0.524	0.773	0.679	0.832	0.777	0.718	0.153
Llama3-Med42-8B	0.416	0.829	0.600	0.379	0.264	0.797	0.856	0.898	0.794	0.707	0.381
Average (Medical Fine-Tuned LLMs)	0.522	0.623	0.617	0.503	0.449	0.660	0.654	0.705	0.660	0.620	0.138

“Overall P” denotes precision. “Δ Knowledge” is the performance change in Overall F1 when knowledge is provided.

Conclusion

We introduce MedHallu, a comprehensive benchmark comprising 10,000 rigorously curated medical question-answer pairs with hallucinated answers. MedHallu integrates fine-grained categorization of medical hallucination types, a hallucination generation framework that balances difficulty levels while mitigating single-LLM bias through multi-model majority voting, and systematically evaluates diverse LLM configurations’ hallucination detection capabilities.

Our evaluation reveals that existing LLMs exhibit significant limitations in detecting medical hallucinations, particularly struggling with “hard” hallucination answers, which are closer in distance to the ground truth. We also provide insights into enhancing LLMs’ hallucination detection: when knowledge is provided, general-purpose LLMs can outperform medical fine-tuned models, and allowing models to decline to answer by providing a “not sure” option improves precision in critical applications.

As the largest open medical hallucination benchmark to date, MedHallu serves as a valuable resource for evaluating LLMs’ medical hallucination detection abilities and offers insights into the cautious use of LLMs in high-stakes medical domains.

BibTeX Citation

@misc{pandit2025medhallucomprehensivebenchmarkdetecting,
  title={MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models}, 
  author={Shrey Pandit and Jiawei Xu and Junyuan Hong and Zhangyang Wang and Tianlong Chen and Kaidi Xu and Ying Ding},
  year={2025},
  eprint={2502.14302},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2502.14302},
}