A vision-language foundation model for the generation of realistic chest X-ray images

Christian Bluethgen^[1]*	Pierre Chambon^[1]*	Jean-Benoit Delbrouck^[1]	Juan M.Z. Chaves^[1]	Rogier Van der Sluijs^[1]
Małgorzata Połacin^[2]	Tanishq Mathew Abraham^[3,4]	Shivanshu Purohit^[4]	Curtis P. Langlotz^[1,2]	Akshay Chaudhari^[1,2]

^[1] Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University
^[2] Stanford Medicine, Department of Radiology, Stanford University
^[3] University of California, Davis
^[4] MedARC, Stability AI
^* equal contributions

[Paper]

[BibTeX]

[Weights]

Abstract

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets.

In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts.

RoentGen is able to create visually convincing, diverse synthetic CXR images. The output can be controlled to a high extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

Synthetic images created by prompting RoentGen for typical CXR abnormalities. The generated CXRs feature high levels of detail: When prompted for ”edema” (top right), perihilar haziness (white arrowheads) and peribronchial cuffing (black arrowhead), both features seen in pulmonary edema, can be observed. For "pneumothorax" (bottom row, third image from the left), a fine line representing the visceral pleural lining of the partially collapsed lung can be delineated (dashed line).

Text-conditioned synthesis of CXR. Each image was hand-picked out of four generated CXR per respective prompt. Here, presence or absence of a finding (pleural effusions, dotted ROI added for visualization) and dimensions like size and laterality were controlled via prompting. Note that the model correctly incorporated the radiological convention of displaying the right patient side on the left side of the image, and vice versa.

Paper

C. Bluethgen, P. Chambon,
JB Delbrouck, R. Van der Sluijs,
M. Polacin, J.M.Z. Chaves,
T. Abraham, and S. Purohit
C.P. Langlotz, A. Chaudhari
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation
2024.

[Bibtex]

Acknowledgements

C.B. received research support from the Swiss Society of Radiology and from the Promedica Foundation, Switzerland. A.S.C. receives research support from the National Institutes of Health (grant numbers R01 HL167974, R01 AR077604, R01 EB002524, R01 AR079431, R01 HL169345 and P41 EB027060) and from GE Healthcare and Philips.

Disclaimer

The information provided on this project site, including the generated images, does not contain or is substitute for any professional medical or health advice. The information is provided in good faith for general informational, research and educational purposes only. The authors make no warranty of any kind, express or implied, regarding the accuracy, validity, reliability, or completeness of any information on the site. Before taking any actions based on the presented information, we encourage you to consult with appropriate medical professionals. THE USE OR RELIANCE OF ANY INFORMATION CONTAINED ON THE SITE IS SOLELY AT YOUR OWN RISK.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.
Code used for the image gallery: SimpleLightbox