A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation

A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation

Juhwan Choi, Sangchul Hahn, Eunho Yang

We study large language models (LLMs) as clinical explanation generators and evaluate their robustness to user doubt in interactive settings. Using an in-hospital mortality prediction task on the MIMIC-III dataset, we examine how simple challenge prompts affect the consistency of LLM-generated explanations. We adopt the concept of doubt robustness and assess it by prompting models to explain risk predictions and indicate agreement, followed by doubt-inducing queries. Our results show that instruction-tuned models frequently reverse their initial stance, while reasoning-enhanced models exhibit improved but still limited stability. Further analysis suggests that LLMs rely heavily on model outputs rather than ground-truth labels, reducing explanation faithfulness. These findings highlight the need for robustness-oriented evaluation of clinical explanation systems.

Status-Aware Self-Supervised Forecasting for Irregular Clinical Time Series

Evaluating the Predictive Potential of an AI-Driven Deep Learning Model for Pneumonia-Associated Sepsis

PUBLICATIONS

A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation