Can you spot a deepfake X-ray? Neither can your radiologist.

These images aren’t hard to generate: The researchers used simple prompts to get ChatGPT to spit out X-rays with a specified anatomical location, disorder, and level of noise. But as easy as it is for the models to create convincing radiographs, they can’t reliably detect them. Four multimodal models, including the one that generated the images, accurately distinguished deepfakes just 57 percent to 85 percent of the time.

“If you give back to ChatGPT the same image, it won’t be able to say for sure this is AI and this is not AI,” lead author Mickael Tordjman, a postdoctoral fellow at Mount Sinai Hospital in New York in the BioMedical Engineering and Imaging Institute. “Which is kind of disturbing.”

While there isn’t evidence that deepfake radiographs have caused disruption in health systems yet, some radiologists say now is the time to make sure that can’t happen. “The widespread availability of deepfake technology risks accelerating the erosion of public trust in medical institutions,” wrote two radiologists in an associated editorial.

STAT spoke with Tordjman about where deepfakes may introduce risk in medicine, and who’s responsible for minimizing it. The following interview has been condensed for clarity and length.

What inspired you to test radiologists’ ability to detect deepfake X-rays?

ChatGPT released this new image generation tool that attracted a lot of interest online. We experimented and tried to see if the images made by ChatGPT were realistic or not — and we noticed that they were surprisingly very realistic. They were almost as good as dedicated models that can generate images, but are trained on millions and millions of images.

And so we thought, maybe we should ask radiologists: Can you differentiate real and fake images? We found these synthetic radiographs generated by LLMs, in particular ChatGPT, are not really easy to differentiate from real radiographs — for experienced or non-experienced radiologists, but also for the LLMs themselves.

So why wasn’t ChatGPT able to consistently identify its own output? There’s no signature?

No, there are no metadata. One of my friends told me today that maybe they are in the process of adding some kind of metadata to show which images were generated by the LLM, but at the time of the study there was no metadata embedded in the image. So if the image is realistic enough, even large language models won’t be able to say this is AI or this is not — especially because most of these models don’t have a memory.

When I tried the quiz, I scored around 75 percent. I’m obviously not a radiologist, so I was surprised I performed similarly to the trained radiologists in your study.

Did you look into the article before trying to do this?

By looking at the articles, the figures, the description, you already kind of trained. So you have improved a lot compared to people that are not aware of any of these signs that can give up the AI image. Some of the radiologists weren’t even aware that ChatGPT could generate images.

You generated the deepfakes yourself, you’ve seen a lot of them. Do you think you’d catch 100 percent of them at this point?

A few days ago I tried just for fun. I was around 85 percent to 90 percent. So even myself, I cannot tell which ones are real or fake for sure.

It was also surprising that radiologists weren’t much better at distinguishing ChatGPT-generated X-rays from those created by a dedicated radiology chest X-ray model. What distinguishes those two approaches?

For most of the image generators that were trained specifically on medical imaging, they are trained on one specific organ. For example, in this study we included RoentGen, which is trained on chest X-ray: It can only generate chest X-ray. They use large databases, either public databases or private databases from hospitals to train.

With very large language models with image generation capabilities, you can generate a chest X-ray — but also hand X-rays, shoulder X- rays, whatever you want. And for ChatGPT or Gemini, we don’t really know what they are using in their training dataset.

That’s striking, that something explicitly trained for radiology looks so similar to something that’s just trained on whatever happens to be on the internet.

All of these large language models are trained on such a vast amount of data. They have millions and millions of whatever is available on the internet. Of course they will be very good at doing even tasks that were not designed to initially.

How much of an issue is this realistically going to create within the medical field? Are radiologists really the ones that will need to contend with accurately identifying deepfake medical images?

Radiologists are the ones that are more exposed. But any M.D. or anyone who is prepared to see medical images should be aware of this. Anyone could be confronted at some point with a fake X-ray and should be able to differentiate real and fake.

There’s a lot of hacking of health systems going on in the past few years, where hackers try to steal patients’ data. And in the future, it’s very possible that they will try to inject fake medical data — which will be even worse than stealing, because then you will not be able to differentiate which part of the medical chart is real and which part is synthetic.

In the absence of really deliberate, malicious activity like that, there’s no normal way that a deepfake medical image could be introduced into the medical record, right?

For now there is no way. But this is the issue with the hacker world, the technology is always evolving. Some people use deepfakes to trick people to give them money, so I don’t see why it would not happen with medical images as well.

Don’t give them any ideas.

Yeah, of course.

Is there any evidence that synthetic images like these are already causing this kind of harm or disruption?

I don’t think so. But there are companies already working on solutions to detect synthetic medical images. So I think this is an issue that everyone is considering very seriously. If they are developing solutions, they are expecting that risk.

You also raised the issue of potential fraud in your paper.

Yes, so this has happened a few times, where a person tried to make some fake insurance claims, for example, and then they used a radiograph from someone else. Now you don’t need to use some radiograph on someone else, you can just generate a fake radiograph with a specific disorder you want.

Are there other areas of medicine where you think deepfakes are a particular concern?

Any kind of medical data could be introduced in patient charts. Let’s say a PDF of a clinical note of a doctor could be completely faked and introduced within the patient chart. This is something that could be created really easily with these LLMs.

So there are radiology AI companies that market the ability to create synthetic datasets for research purposes — say you want to study a rare finding that you don’t have many real radiographs of. What’s the distinction between that and ChatGPT’s synthetic images?

The difference is that the radiology AI companies are more advanced, of course, compared to what ChatGPT, Gemini, or whatever LLM can generate. And they will be, probably, more realistic and more dedicated to a specific disorder.

There are a lot of good reasons to use synthetic images. You can increase the data set to train AI models to do some tasks — for example in medical images to segment lesions or do fracture recognition. Or even for education purposes — if you want to train a student, you can create a fake X-ray with a specific disorder and show the anatomy. You don’t need to find the specific case in your system because you can generate it.

Are there any risks to the validity of those specialized radiology models if they were to be polluted with deepfakes? And what’s the likelihood of that happening — you’d probably have to have a malicious actor injecting the deepfakes there too, right?

You told me not to give them ideas, but you are currently giving them ideas. But yes, of course, if someone was to generate fake images, they could completely disturb an existing AI tool and decrease its performance, because it will learn based on fake pathologies. Let’s say the malicious actor would introduce some fake images that are not realistic at all, then the performance will be completely destroyed.

Tell me about what happens on the other side of these LLMs right now — when a patient uploads their own real X-ray for the model to interpret.

ChatGPT, Gemini, or Claude, if you give them an X-ray, they won’t give you a medical answer that is correct. Even based on their training dataset that is huge, they are still not good enough to give a good interpretation. Sometimes they don’t even give the right anatomy. If you put in an arm X-ray, they could say, “Oh, this is a leg.’”

So radiologists couldn’t tell the difference, models couldn’t tell the difference. So what steps does the field reasonably need to take to protect against those risks that we’ve talked about?

The best solution would be watermarking. You can add and embed in the image some kind of hidden clue that the image was created by a large language model like ChatGPT. You could also be able to differentiate the real image from the fake image by putting a watermark with the name of the hospital or with the names of the technicians that did the X-ray.

With models improving, it will be harder and harder to differentiate them. So we will need at some point some kind of watermark.

Until then, what do you hope people will take away from your quiz?

Specialist education is a good solution in the meantime. There is a way to improve really quickly when looking at these fake X-rays. Once you are aware of these clues, it’s kind of easy to differentiate most of them. Then if you are a radiologist — but even if you are not, like, let’s say you’re an orthopedic surgeon or rheumatologist or neurologist — you can say, OK, this is weird, let me take a closer look.

Original source: us