Image-driven fact-checking of ai generated chest radiology reports

Loading...
Thumbnail Image
Authors
Mahmood, Raziuddin
Issue Date
2025-08
Type
Electronic thesis
Thesis
Language
en_US
Keywords
Biomedical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract
With the developments in radiology artificial intelligence (AI), many researchers have turnedto the problem of automated reporting of imaging studies. The goal of such work is to produce a preliminary read of imaging studies in locations such as emergency rooms where a radiologist may not be readily available, or to present a preliminary structured report to radiologists to reduce their dictation workload. An automatically produced structured report could also be more consistent and easier to read, leading to improved accuracy and lower overall costs of radiology reads in clinical workflows. Among the imaging areas where this has been found most useful are chest X-rays, which are the most common imaging modality read by radiologists in hospitals and tele- radiology practices today. With the recent rise of generative AI, a number of researchers and corporations are attempting to generate preliminary reports for chest X-ray images thanks to the availability of relatively large datasets such as MIMIC and CheXpert that come with their companion reports for training large vision-language models (VLMs). These newly emerged VLMs can generate longer and more natural sentences when prompted with good radiology-specific linguistic cues. However, despite the powerful language generation capabil- ities, ensuring there are no hallucinations, incorrect mentions of findings or their descriptions, has been difficult for these models limiting their clinical applicability. While methods for hallucination removal and fact-checking exist for large language models, with strategies such as direct policy optimization (DPO) or proximal policy optimization (PPO), and reward models, they are mostly applicable during training or fine-tuning of the models. On the other hand, methods that check facts during inference time often consult external general knowledge or detect errors through analysis of produced text either by themselves or through an LLM serving as a judge. In radiology report generation, however, neither is possible since the report has to be specific to the patient and consistent with the evidence seen in the imaging. Since the automated reporting LLMs themselves have hallucinations, there are no teacher LLMs that are good enough to correct automatically generated radiology reports. Further, they may not be able to corroborate their deductions with the patient-specific im- age. Finally, any fact checking should be agnostic to the radiology report generation tool to give versatility of use during clinical deployment where different choices of vendors may be prevalent with separate evolving capabilities over time. Thus, there is a need to develop an independent fact-checking method for use during clinical inference to bootstrap radiology report generation and increase their adoption in clinical workflows. This Master’s thesis investigates a hypothesis that it is possible to develop such inde- pendent discriminative neural networks as fact-checking models for use during inference to detect and correct errors in automatically generated reports. The key idea explored in the thesis is that by creating a synthetic dataset of real and fake findings derived from ground truth reports and pairing them with the corresponding chest X-ray images, a fact-checking classifier could be trained to distinguish between real/correct description of findings and incorrect description of findings when they are paired with the corresponding images. Such an independently developed classifier can then be used to detect and correct errors in the reports generated by automated radiology reporting tools. To proceed with the verification of the hypothesis, the thesis is divided into 4 investi- gations. First, by examining several radiology reporting methods, we analyze the types of errors made by the report generators to conclude four major error types such as irrelevant predictions, polarity reversal or omissions, incorrect location predictions and other types such as incorrect severity assessments. We then simulate the errors to create a large synthetic dataset by perturbing findings and their locations in ground truth reports reflecting real and fake findings-location pairs with images. We then proceed to build a discriminative classifier to detect the errors and remove the finding errors in reports using two different methods, one that is based on the findings alone and the other that captures their spatial locations. Finally, we develop methods to correct the automated report while still ensuring language correctness by careful prompting of a large language model using information derived from the fact checking model. Throughout, we conduct experiments with multiple benchmark datasets and conduct ablation experiments to select relevant architectural configurations and document the overall improvement in the quality of the report by the use of our fact-checking model to detect and correct errors. A novel measure was developed for assessing the report correctness leveraging both clinical accuracy and phrase grounding accuracy. Explainable visualizations were generated to show the deviation of the reported findings from predicted findings and their locations generated by the fact-checking model. The overall results indicated that it was possible to develop a fact-checking model using an independently collected dataset of real and fake findings to simulate the errors made by report generators. The resulting fact-checking model was over 90% accurate as tested on multiple benchmark datasets and led to improvement in the quality of the automatically generated reports in the range of 7-29%. A high degree of concordance was found between the use of our fact-checking model and ground truth for verification of automated reports leading us to also conclude that the fact-checking model has the potential to serve as a surrogate ground truth during clinical inference. This proves further utility of our model as an additional validation checkpoint in making AI models robust and ready for clinical workflows.
Description
August2025
School of Engineering
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
Journal
Volume
Issue
PubMed ID
DOI
ISSN
EISSN
Collections