ChatGPT Health Fails to Detect Medical Emergencies, Study Reveals

ChatGPT Health Under-Triages Medical Emergencies, Study Warns

ChatGPT Health, an AI platform launched by OpenAI in January, has been found to regularly miss the need for urgent medical care and frequently fails to detect suicidal ideation, according to a new study. Experts warn this could "feasibly lead to unnecessary harm and death," as the system under-triaged more than half of cases presented in simulations.

Independent Safety Evaluation Reveals Critical Flaws

The first independent safety evaluation of ChatGPT Health, published in the February edition of Nature Medicine, involved 60 realistic patient scenarios covering conditions from mild illnesses to emergencies. Lead author Dr. Ashwin Ramaswamy and his team generated nearly 1,000 responses by asking the platform for advice under various conditions, such as changing patient gender or adding test results.

Three independent doctors reviewed each scenario and agreed on the necessary level of care based on clinical guidelines. When comparing ChatGPT Health's recommendations to these assessments, the platform performed well in textbook emergencies like stroke or severe allergic reactions but struggled in other situations.

Alarming Failure Rates in Critical Cases

In 51.6% of cases where immediate hospital visits were medically necessary, ChatGPT Health advised staying home or booking a routine appointment. Alex Ruani, a doctoral researcher in health misinformation mitigation at University College London, described this as "unbelievably dangerous." She emphasized, "If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life."

In one asthma scenario, the platform identified early warning signs of respiratory failure but still recommended waiting rather than seeking emergency treatment. Ruani noted that in simulations, eight times out of 10 (84%), ChatGPT Health sent a suffocating woman to a future appointment she wouldn't live to see. Conversely, 64.8% of completely safe individuals were incorrectly told to seek immediate medical care.

Suicidal Ideation Detection Inconsistent

Dr. Ramaswamy expressed particular concern over the platform's under-reaction to suicidal ideation. When testing with a 27-year-old patient describing thoughts of taking pills, the crisis intervention banner appeared every time based on symptoms alone. However, when normal lab results were added, the banner vanished in all 16 attempts. Ramaswamy stated, "A crisis guardrail that depends on whether you mentioned your labs is not ready, and it's arguably more dangerous than having no guardrail at all."

External Influences and Safety Concerns

The study also found that ChatGPT Health was nearly 12 times more likely to downplay symptoms if a "friend" in the scenario suggested it was nothing serious. Ruani highlighted the need for urgent development of clear safety standards and independent auditing mechanisms to reduce preventable harm.

Prof. Paul Henman, a digital sociologist and policy expert at the University of Queensland, called the paper "really important," noting that misuse could lead to unnecessary medical presentations or failure to obtain urgent care. He raised concerns about legal liability, referencing ongoing cases against tech companies related to suicide and self-harm from AI chatbots.

OpenAI's Response and Future Implications

A spokesperson for OpenAI welcomed independent research but argued the study did not reflect typical real-life usage, adding that the model is continuously updated. Despite this, Ruani maintained that "a plausible risk of harm is enough to justify stronger safeguards and independent oversight."

With over 40 million people reportedly asking ChatGPT for health-related advice daily, the findings underscore critical gaps in AI-driven healthcare tools. Experts stress the need for transparency in training data and guardrails to prevent potential tragedies.