The deployment of large language models in clinical psychiatry represents a compelling but precarious frontier. Unlike image classification or document summarization, psychiatric risk assessment demands not just accuracy but calibrated stability—the model's predictions must remain consistent when presented with clinically irrelevant information, and their reasoning must withstand scrutiny from domain experts. Yet as LLMs increasingly power downstream clinical tasks, from triage systems to risk stratification, a fundamental question remains unanswered: how reliably do these systems actually perform in the psychiatric domain, where diagnostic uncertainty is inherent and high-stakes decisions rest on nuanced clinical judgment?
This new work from arXiv (2604.22063) tackles precisely this gap through a systematic reliability audit framework. Rather than asking whether LLMs can predict hospitalization risk—they demonstrably can—the authors ask a more incisive question: how sensitive are these predictions to contextual noise and prompt framing? For clinical deployment, this distinction is not academic; it is foundational. A model that performs well on clean, curated datasets may catastrophically fail when exposed to the messy reality of electronic health records, where irrelevant historical notes, administrative metadata, and tangential patient information inevitably contaminate the signal.
The experimental design reflects sophisticated thinking about real-world failure modes. The researchers constructed synthetic patient profiles (n=50) containing 15 clinically relevant features—symptom severity, psychiatric history, medication status, and similar variables that legitimately inform hospitalization decisions. Critically, they then injected up to 50 medically insignificant features: irrelevant biographical details, unrelated medical history, procedural notes without clinical bearing. This injection of noise mirrors real clinical practice, where LLMs encounter dense, heterogeneous documentation. The audit tested four production-grade models (Gemini 2.5 Flash, LLaMA 3.3 70B, Claude Sonnet 4.6, GPT-4o mini) across four distinct prompt framings: neutral clinical language, logical-reasoning scaffolding, human-impact framing, and clinical-judgment-focused prompts. This factorial design isolates both model-intrinsic sensitivity and prompt-dependent variability.
The results are sobering. Across all four models and all prompt conditions, the inclusion of clinically insignificant variables produced statistically significant increases in both absolute mean predicted hospitalization risk and output variability. In other words, the presence of noise didn't merely add random jitter—it systematically inflated risk predictions while simultaneously destabilizing them. This dual failure (bias + variance) is particularly dangerous in clinical contexts. A clinician might tolerate a model that occasionally overpredicts risk; they cannot tolerate a model whose predictions drift unpredictably based on spurious features. The authors quantify this through measures of "attributional stability," essentially asking: does the model's reasoning remain consistent when irrelevant information is added? The answer, empirically, is no.
The prompt-design findings add another layer of concern. Different prompt reframings produced model-dependent trajectories of instability—meaning that the same model responded differently to noise depending on how the clinical task was linguistically framed, while different models showed distinct sensitivity profiles. This interaction effect (model × prompt × noise) suggests no universal mitigation strategy exists. You cannot simply craft a better prompt and expect robust performance across architectures. The architectural differences between transformer-based models (Claude, GPT) and open-weight systems (LLaMA) appear to manifest as different failure modes under noisy conditions.
Within the broader landscape of AI-in-healthcare research, this work occupies critical terrain. The field has rightfully focused on demographic bias, fairness metrics, and out-of-distribution robustness. But this study identifies a distinct vulnerability: contextual fragility. LLMs are trained on vast, heterogeneous text corpora where spurious correlations abound. When deployed on structured clinical tasks, their susceptibility to irrelevant features reflects deeper issues with how they weight information and construct reasoning chains. The psychiatric domain amplifies these concerns because psychiatric diagnosis itself is inherently probabilistic and context-dependent; the models are being asked to perform a task that challenges human clinicians, yet without the metacognitive awareness that allows experts to recognize uncertainty.
CuraFeed Take: This paper should function as a circuit-breaker for premature clinical deployment. The findings are not surprising to researchers familiar with adversarial robustness and prompt sensitivity, but their systematic documentation in the psychiatric domain is valuable precisely because psychiatry has become a proving ground for clinical LLM applications. The real insight here is methodological: the authors have operationalized "reliability auditing" as a pre-deployment requirement. This framework—varying irrelevant features, testing multiple prompt framings, measuring both bias and variance—should become standard practice before any LLM touches clinical data. What's particularly noteworthy is that the instability persists across model scale and architecture, suggesting the problem is not solvable through simple scaling or fine-tuning. The field needs either fundamentally different architectures for clinical reasoning (perhaps neuro-symbolic hybrids that explicitly separate clinical knowledge graphs from language understanding), or far more stringent validation protocols that treat LLM outputs as weak signals requiring human verification, not decision-support tools. The models that performed "best" here still failed the stability test. That should concern anyone betting on LLM-driven psychiatric triage systems in the next 24 months.