Image-driven fact-checking of ai generated chest radiology reports
Loading...
Authors
Mahmood, Raziuddin
Issue Date
2025-08
Type
Electronic thesis
Thesis
Thesis
Language
en_US
Keywords
Biomedical engineering
Alternative Title
Abstract
With the developments in radiology artificial intelligence (AI), many researchers have turnedto the problem of automated reporting of imaging studies. The goal of such work is to
produce a preliminary read of imaging studies in locations such as emergency rooms where
a radiologist may not be readily available, or to present a preliminary structured report
to radiologists to reduce their dictation workload. An automatically produced structured
report could also be more consistent and easier to read, leading to improved accuracy and
lower overall costs of radiology reads in clinical workflows.
Among the imaging areas where this has been found most useful are chest X-rays,
which are the most common imaging modality read by radiologists in hospitals and tele-
radiology practices today. With the recent rise of generative AI, a number of researchers and
corporations are attempting to generate preliminary reports for chest X-ray images thanks
to the availability of relatively large datasets such as MIMIC and CheXpert that come with
their companion reports for training large vision-language models (VLMs). These newly
emerged VLMs can generate longer and more natural sentences when prompted with good
radiology-specific linguistic cues. However, despite the powerful language generation capabil-
ities, ensuring there are no hallucinations, incorrect mentions of findings or their descriptions,
has been difficult for these models limiting their clinical applicability. While methods for
hallucination removal and fact-checking exist for large language models, with strategies such
as direct policy optimization (DPO) or proximal policy optimization (PPO), and reward
models, they are mostly applicable during training or fine-tuning of the models. On the
other hand, methods that check facts during inference time often consult external general
knowledge or detect errors through analysis of produced text either by themselves or through
an LLM serving as a judge. In radiology report generation, however, neither is possible since
the report has to be specific to the patient and consistent with the evidence seen in the
imaging. Since the automated reporting LLMs themselves have hallucinations, there are no
teacher LLMs that are good enough to correct automatically generated radiology reports.
Further, they may not be able to corroborate their deductions with the patient-specific im-
age. Finally, any fact checking should be agnostic to the radiology report generation tool
to give versatility of use during clinical deployment where different choices of vendors may be prevalent with separate evolving capabilities over time. Thus, there is a need to develop
an independent fact-checking method for use during clinical inference to bootstrap radiology
report generation and increase their adoption in clinical workflows.
This Master’s thesis investigates a hypothesis that it is possible to develop such inde-
pendent discriminative neural networks as fact-checking models for use during inference to
detect and correct errors in automatically generated reports. The key idea explored in the
thesis is that by creating a synthetic dataset of real and fake findings derived from ground
truth reports and pairing them with the corresponding chest X-ray images, a fact-checking
classifier could be trained to distinguish between real/correct description of findings and
incorrect description of findings when they are paired with the corresponding images. Such
an independently developed classifier can then be used to detect and correct errors in the
reports generated by automated radiology reporting tools.
To proceed with the verification of the hypothesis, the thesis is divided into 4 investi-
gations. First, by examining several radiology reporting methods, we analyze the types of
errors made by the report generators to conclude four major error types such as irrelevant
predictions, polarity reversal or omissions, incorrect location predictions and other types such
as incorrect severity assessments. We then simulate the errors to create a large synthetic
dataset by perturbing findings and their locations in ground truth reports reflecting real and
fake findings-location pairs with images. We then proceed to build a discriminative classifier
to detect the errors and remove the finding errors in reports using two different methods,
one that is based on the findings alone and the other that captures their spatial locations.
Finally, we develop methods to correct the automated report while still ensuring language
correctness by careful prompting of a large language model using information derived from
the fact checking model.
Throughout, we conduct experiments with multiple benchmark datasets and conduct
ablation experiments to select relevant architectural configurations and document the overall
improvement in the quality of the report by the use of our fact-checking model to detect
and correct errors. A novel measure was developed for assessing the report correctness
leveraging both clinical accuracy and phrase grounding accuracy. Explainable visualizations
were generated to show the deviation of the reported findings from predicted findings and
their locations generated by the fact-checking model.
The overall results indicated that it was possible to develop a fact-checking model using an independently collected dataset of real and fake findings to simulate the errors made by
report generators. The resulting fact-checking model was over 90% accurate as tested on
multiple benchmark datasets and led to improvement in the quality of the automatically
generated reports in the range of 7-29%. A high degree of concordance was found between
the use of our fact-checking model and ground truth for verification of automated reports
leading us to also conclude that the fact-checking model has the potential to serve as a
surrogate ground truth during clinical inference. This proves further utility of our model
as an additional validation checkpoint in making AI models robust and ready for clinical workflows.
Description
August2025
School of Engineering
School of Engineering
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY