Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays

Hallucinations have plagued OpenAI ever since it launched its blockbuster ChatGPT chatbot back in 2022.

The propensity of large language models to sound both plausible and confident about outputs that are totally wrong continues to represent a major thorn in the sides of execs who claim the AI boom is both bigger and faster than the industrial revolution.

The issue still haunts even the most sophisticated AI models today, a persistent issue unlikely to be resolved any time soon — if ever, experts warn.

It’s a particularly troublesome reality in a healthcare setting, from Google’s AI Overviews feature giving out dangerous “health” advice to hospitals deploying transcription tools that invent nonexistent medications and more.

And when it comes to analyzing radiology scans — an application for AI long championed by its advocates in the healthcare industry — the situation becomes even more concerning.

As detailed in a new, yet-to-be-peer-reviewed paper, a team of researchers at Stanford University found that frontier AI models readily generated “detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.”

In other words, the AI models happily came up with answers to questions about a supposedly accompanying image — even if the researchers never even showed it an image.

As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: “mirage reasoning.”

The effect “involves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,” the researchers wrote in their paper.

The damning findings suggest AI models cheat by diving into the data they were given — and coming up with the rest based on probability, even if it’s almost entirely conjecture.

“What we try to show is that even on the best benchmarks, although a question would seem unsolvable for a human, the LLMs might still be able to leverage question-level and dataset-level patterns behind it and use general statistics and prevalence data to answer them right, while also learning to talk ‘as if’ they were seeing the image,” coauthor and Stanford PhD student Mohammad Asadi told Futurism.

In other words, “we are underestimating how much information could be hidden in a sentence or a question if you (the LLM) are trained on all of the internet,” he added. “To conclude, we believe that the AI models are able to use their super-human memory and language skills to hide their weaknesses in multimodal understanding (and by talking like [they] are actually doing multi-modal reasoning).”

Asadi and his colleagues are calling for an overhaul of existing benchmarks to avoid negative consequences, particularly “in medical contexts where miscalibrated AI carries the greatest consequence.”

In one experiment, the team came up with a new benchmark that consists of visual questions across “medicine, science, technical, and general visual understanding” — but with the images removed.

They found that all of the frontier models they tested, including OpenAI’s GPT-5, Google’s Gemini 3 Pro, and Anthropic’s Claude Opus 4.5, confidently provided “descriptions of visual details.”

“In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images,” the researchers wrote in the paper.

In another experiment, the team challenged the AI models to “guess answers without image access, rather than being implicitly prompted to assume images were present,” which resulted in a major hit to performance, suggesting they fared much better when not made aware they were lacking vital data.

“Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided,” the researchers wrote.

“The benchmark tested in the super-guesser experiment, ReXVQA, is actually one of the best and most comprehensive benchmarks for chest radiology available, spanning a wide range of tasks and questions,” Asadi told Futurism.

To address the issue, the researcher argued that “improved benchmarks would need to be evaluated more rigorously.” However, that could prove difficult as “on some level, every benchmark will inevitably become susceptible to this over time, since the test set questions might leak into the large [pretraining] data the moment they appear on the internet.”

Asadi and his colleagues came up with a new framework, dubbed “B-Clean,” which involves identifying and removing any “compromised questions, including, but not limited to, vision-independent, prior knowledge answerable, and data-contaminated questions.” The idea is to ultimately test models on the remaining questions that “none of the candidate models could answer without visual input, enabling a fair, vision-grounded comparison.”

While Asadi admitted that it’s “hard to discuss every possible real-world implication,” it’s an alarming finding that comes as hospital execs continue to push for replacing radiologists with AI.

If “deployed without sufficient guardrails in place, this might result in alarming false positives at any instance where there is a failure in the multimodal processing, especially in the currently growing ‘agentic systems’ in which such a mistake from a small model could propagate through the whole system and cause unforeseen outcomes,” Asadi told Futurism.

It’s part of a much broader breakdown in trust when it comes to handing over high-risk tasks to AI.

“Another implication is that, now that we know an AI can say ‘I see evidence of malignant melanoma on your skin’ without even having access to any images, how much can we trust it when it says the same while actually seeing the image?” Asadi posited. “We definitely need more effort being put in safety and alignment of such models, and might need to think twice before deploying them in user/patient-facing systems.”

“On a high level, I would our message is that although AI is great, its superhuman capabilities in some skills (such as language) should not be mistaken for an ability in other tasks,” he concluded. “The number one [takeaway] would be that just because the AI is saying, very convincingly, that it is seeing something, it doesn’t mean that it is actually seeing that.”

More on AI and radiology: Doctors Horrified After Google’s Healthcare AI Makes Up a Body Part That Does Not Exist in Humans

Original source: us