ChatGPT Health: Stress Test Reveals Triage Risks & Safety Concerns

The recently launched ChatGPT Health, OpenAI’s foray into consumer health tools, is already facing scrutiny. A rigorous stress test, published in February 2026 in Nature Medicine, reveals significant gaps in its ability to accurately assess medical urgency, particularly at the extremes of the clinical spectrum. The findings, stemming from a study conducted by researchers at Mount Sinai, raise crucial questions about the readiness of such systems for widespread public utilize.

Evaluating Triage Accuracy: A Complex Test

The study involved presenting ChatGPT Health with 60 clinician-authored patient scenarios, spanning 21 different medical conditions and factoring in 16 variables. This resulted in a total of 960 responses analyzed by the research team. The goal was to evaluate how well the system could triage patients – that is, determine the appropriate level of care and timeframe for medical attention. The results showed an “inverted U-shaped pattern” of performance, meaning accuracy was highest for straightforward cases, but dropped off considerably when dealing with either highly minor or very serious conditions. Specifically, the system miscategorized 35% of non-urgent presentations and 48% of emergency situations.

Perhaps most concerning was the finding that ChatGPT Health under-triaged 52% of what researchers defined as “gold-standard” emergency cases. This meant the system recommended a 24-to-48-hour evaluation for patients who, based on clinical expertise, should have been directed immediately to the emergency department. Examples included individuals presenting with diabetic ketoacidosis – a life-threatening complication of diabetes – and those experiencing impending respiratory failure. However, the system demonstrated accuracy in triaging more commonly recognized emergencies like stroke and anaphylaxis.

The Impact of Context and Bias

The study also explored how external factors influenced ChatGPT Health’s recommendations. Researchers found that when information provided to the system included minimization of symptoms by family or friends – a phenomenon known as “anchoring bias” – triage recommendations shifted significantly towards less urgent care. The odds of a less urgent recommendation increased by a factor of 11.7 (with a 95% confidence interval of 3.7-36.6). This highlights the vulnerability of these systems to the way information is presented, and the potential for delayed care if patients rely solely on the tool’s assessment.

Interestingly, the study found no significant effects related to patient race, gender, or barriers to care, whereas the researchers noted that the confidence intervals did not entirely rule out the possibility of clinically meaningful differences. This suggests that, at least within the parameters of this study, the system did not exhibit overt biases based on these demographic factors. However, the authors caution that further investigation is needed to fully understand potential disparities.

Crisis Intervention: An Unpredictable Response

The response of ChatGPT Health to presentations involving suicidal ideation was also found to be inconsistent. Crisis intervention messages were activated unpredictably, sometimes firing when patients described no specific method for self-harm, and other times remaining silent when a clear plan was articulated. This erratic behavior raises serious safety concerns about the system’s ability to provide appropriate support to individuals in crisis. The full study details are available in Nature.

What Does This Mean for Consumers?

These findings do not necessarily mean that ChatGPT Health is inherently unsafe, but they underscore the critical need for caution and validation before widespread deployment. It’s important to remember that these tools are not a substitute for professional medical advice. Triage, even for experienced clinicians, is a complex process that requires nuanced judgment and a thorough understanding of a patient’s individual circumstances. ChatGPT Health, like other AI-powered tools, is susceptible to errors and limitations.

The concept of triage itself involves categorizing patients based on the severity of their condition to prioritize care. Under-triage, as seen in this study, means that patients with serious conditions may not receive the timely attention they need, potentially leading to adverse outcomes. Conversely, over-triage – recommending emergency care for non-emergency situations – can strain healthcare resources.

Ongoing Research and Future Directions

The research team at Mount Sinai intends to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools. Their ongoing work will expand to include areas such as pediatric care, medication safety, and the use of non-English languages. This is crucial, as the performance of these systems may vary significantly across different populations and clinical contexts.

The study’s authors emphasize the need for prospective validation – real-world testing of the system’s performance in a clinical setting – before it is widely adopted. This would involve monitoring the system’s recommendations and comparing them to the assessments of qualified healthcare professionals. Further reporting on the study highlights the importance of understanding the limitations of these tools and ensuring that they are used responsibly.

Next Steps: Validation and Refinement

The path forward involves a multi-faceted approach. OpenAI and other developers of AI-powered health tools must prioritize rigorous testing and validation. Regulatory bodies may need to establish standards and guidelines for the development and deployment of these systems. And, perhaps most importantly, consumers need to be educated about the capabilities and limitations of these tools, and encouraged to consult with qualified healthcare professionals for any medical concerns.