Why consumer AI chatbots stumble on medical diagnosis
Don’t trust your favorite consumer-grade LLM chatbot with your health decisions. If you can’t provide the correct information in the first place, don’t expect an accurate diagnosis. Despite warnings on all LLMs to trust your doctor or medical professional, many people put false hope in the Chatbot anyway.
New research makes a simple point: these chatbots are flawed when used as stand-ins for doctors. The study shows they break down particularly when patient information is incomplete. That matters because early clinical reasoning is messy and often starts with gaps.
Researchers found leading large language models tend to collapse to a single diagnosis too fast when data is thin. They rarely offer a robust differential list at the open-ended beginning of a case. That narrowing raises the odds of missing alternative, important possibilities.
The limitation is structural: models can often name a final diagnosis once all the pieces are on the table, but they struggle when the puzzle is only partly built. In real-world care you rarely begin with perfect information. Relying on a chatbot’s early suggestions can steer decisions the wrong way.
“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information,” said Arya Rao, the study’s lead author and a researcher at the Massachusetts-based Mass General Brigham healthcare system.
The paper, in JAMA Network Open, tested models using 29 clinical vignettes taken from a standard medical reference. Investigators revealed case details in stages—history, exam findings, then lab results—to mimic how clinicians gather information. They asked the LLMs diagnostic questions at each step and tracked how often answers were incomplete or wrong.
The team evaluated 21 LLMs, including models from OpenAI, Anthropic, Google, xAI and DeepSeek. When asked to produce a differential diagnosis with limited information, failure rates exceeded 80 percent for all models. Those are not minor misses; they reflect an inability to cover plausible alternatives early on.
As cases were completed, model performance improved: failure rates dropped to under 40 percent for final diagnoses. The top systems achieved over 90 percent accuracy once the full dataset was provided. That split shows these tools can help confirm a well-specified conclusion but are weak at exploratory clinical reasoning.
Anthropic says Claude is trained to direct people who ask medical questions to professionals. Google says Gemini has reminders built into its app to prompt users to double-check information. OpenAI’s usage policy says its services should not be used to provide medical advice requiring a licence without appropriate professional involvement.
xAI did not respond to a request for comment. DeepSeek could not be reached for comment. Companies are also building specialized medical models like Google’s Articulate Medical Intelligence Explorer (AMIE) and tools such as MedFound aimed at clinical tasks.
Some early evaluations of purpose-built medical models looked promising, but experts caution those tools still lack a doctor’s sensory context. Early results from evaluations of models such as AMIE were promising, said Sanjay Kinra, a clinical epidemiologist at the London School of Hygiene & Tropical Medicine. But they were unlikely to be able to match how doctors’ clinical assessments “rely heavily on the look and feel of the patient”, he added.
Kinra also noted a practical angle: “Nevertheless, they may have a role to play, particularly in situations or geographies in which access to doctors is limited,” Kinra said. “So we urgently need research studies with actual patients from those settings.”
The takeaway is practical and narrow: these chatbots can assist when cases are well defined, but they are not replacements for hands-on clinical care. Users and developers should treat early diagnostic suggestions from consumer LLMs as provisional. Clinical outcomes depend on human judgment, bedside assessment and follow-up testing, not just a confident-sounding reply.
