Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays

Hallucinations have plagued OpenAI ever since it launched its blockbuster ChatGPT chatbot back in 2022.

The propensity of large language models to sound both plausible and confident about outputs that are totally wrong continues to represent a major thorn in the sides of execs who claim the AI boom is both bigger and faster than the industrial revolution.

The issue still haunts even the most sophisticated AI models today, a persistent issue unlikely to be resolved any time soon โ€” if ever, experts warn.

Itโ€™s a particularly troublesome reality in a healthcare setting, from Googleโ€™s AI Overviews feature giving out dangerous โ€œhealthโ€ advice to hospitals deploying transcription tools that invent nonexistent medications and more.

And when it comes to analyzing radiology scans โ€” an application for AI long championed by its advocates in the healthcare industry โ€” the situation becomes even more concerning.

As detailed in a new, yet-to-be-peer-reviewed paper, a team of researchers at Stanford University found that frontier AI models readily generated โ€œdetailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.โ€

In other words, the AI models happily came up with answers to questions about a supposedly accompanying image โ€” even if the researchers never even showed it an image.

As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: โ€œmirage reasoning.โ€

The effect โ€œinvolves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,โ€ the researchers wrote in their paper.

The damning findings suggest AI models cheat by diving into the data they were given โ€” and coming up with the rest based on probability, even if itโ€™s almost entirely conjecture.

โ€œWhat we try to show is that even on the best benchmarks, although a question would seem unsolvable for a human, the LLMs might still be able to leverage question-level and dataset-level patterns behind it and use general statistics and prevalence data to answer them right, while also learning to talk โ€˜as ifโ€™ they were seeing the image,โ€ coauthor and Stanford PhD student Mohammad Asadi told Futurism.

In other words, โ€œwe are underestimating how much information could be hidden in a sentence or a question if you (the LLM) are trained on all of the internet,โ€ he added. โ€œTo conclude, we believe that the AI models are able to use their super-human memory and language skills to hide their weaknesses in multimodal understanding (and by talking like [they] are actually doing multi-modal reasoning).โ€

Asadi and his colleagues are calling for an overhaul of existing benchmarks to avoid negative consequences, particularly โ€œin medical contexts where miscalibrated AI carries the greatest consequence.โ€

In one experiment, the team came up with a new benchmark that consists of visual questions across โ€œmedicine, science, technical, and general visual understandingโ€ โ€” but with the images removed.

They found that all of the frontier models they tested, including OpenAIโ€™s GPT-5, Googleโ€™s Gemini 3 Pro, and Anthropicโ€™s Claude Opus 4.5, confidently provided โ€œdescriptions of visual details.โ€

โ€œIn the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images,โ€ the researchers wrote in the paper.

In another experiment, the team challenged the AI models to โ€œguess answers without image access, rather than being implicitly prompted to assume images were present,โ€ which resulted in a major hit to performance, suggesting they fared much better when not made aware they were lacking vital data.

โ€œExplicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided,โ€ the researchers wrote.

โ€œThe benchmark tested in the super-guesser experiment, ReXVQA, is actually one of the best and most comprehensive benchmarks for chest radiology available, spanning a wide range of tasks and questions,โ€ Asadi told Futurism.

To address the issue, the researcher argued that โ€œimproved benchmarks would need to be evaluated more rigorously.โ€ However, that could prove difficult as โ€œon some level, every benchmark will inevitably become susceptible to this over time, since the test set questions might leak into the large [pretraining] data the moment they appear on the internet.โ€

Asadi and his colleagues came up with a new framework, dubbed โ€œB-Clean,โ€ which involves identifying and removing any โ€œcompromised questions, including, but not limited to, vision-independent, prior knowledge answerable, and data-contaminated questions.โ€ The idea is to ultimately test models on the remaining questions that โ€œnone of the candidate models could answer without visual input, enabling a fair, vision-grounded comparison.โ€

While Asadi admitted that itโ€™s โ€œhard to discuss every possible real-world implication,โ€ itโ€™s an alarming finding that comes as hospital execs continue to push for replacing radiologists with AI.

If โ€œdeployed without sufficient guardrails in place, this might result in alarming false positives at any instance where there is a failure in the multimodal processing, especially in the currently growing โ€˜agentic systemsโ€™ in which such a mistake from a small model could propagate through the whole system and cause unforeseen outcomes,โ€ Asadi told Futurism.

Itโ€™s part of a much broader breakdown in trust when it comes to handing over high-risk tasks to AI.

โ€œAnother implication is that, now that we know an AI can say โ€˜I see evidence of malignant melanoma on your skinโ€™ without even having access to any images, how much can we trust it when it says the same while actually seeing the image?โ€ Asadi posited. โ€œWe definitely need more effort being put in safety and alignment of such models, and might need to think twice before deploying them in user/patient-facing systems.โ€

โ€œOn a high level, I would our message is that although AI is great, its superhuman capabilities in some skills (such as language) should not be mistaken for an ability in other tasks,โ€ he concluded. โ€œThe number one [takeaway] would be that just because the AI is saying, very convincingly, that it is seeing something, it doesnโ€™t mean that it is actually seeing that.โ€

More on AI and radiology: Doctors Horrified After Googleโ€™s Healthcare AI Makes Up a Body Part That Does Not Exist in Humans

Original source: us