RoentGen: Vision-Language Foundation Model
for Chest X-ray Generation

Christian Bluethgen[1]*
Jean-Benoit Delbrouck[1]
Juan M.Z. Chaves[1] Rogier Van der Sluijs[1]
Małgorzata Połacin[2] Tanishq Mathew Abraham[3,4] Shivanshu Purohit[4]
Curtis P.

[1] Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University
[2] Stanford Medicine, Department of Radiology, Stanford University
[3] University of California, Davis
[4] Stability AI
* equal contributions



Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets.

In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts.

RoentGen is able to create visually convincing, diverse synthetic CXR images. The output can be controlled to a high extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

Synthetic images created by prompting RoentGen for typical CXR abnormalities. The generated CXRs feature high levels of detail: When prompted for ”edema” (top right), perihilar haziness (white arrowheads) and peribronchial cuffing (black arrowhead), both features seen in pulmonary edema, can be observed. For "pneumothorax" (bottom row, third image from the left), a fine line representing the visceral pleural lining of the partially collapsed lung can be delineated (dashed line).

Text-conditioned synthesis of CXR. Each image was hand-picked out of four generated CXR per respective prompt. Here, presence or absence of a finding (pleural effusions, dotted ROI added for visualization) and dimensions like size and laterality were controlled via prompting. Note that the model correctly incorporated the radiological convention of displaying the right patient side on the left side of the image, and vice versa.


P. Chambon, C. Bluethgen,
JB Delbrouck, R. Van der Sluijs,
M. Polacin, J.M.Z. Chaves,
T. Abraham, and S. Purohit
CP. Langlotz, A. Chaudhari
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation
2022. (hosted on ArXiv)



Research reported in this publication was made possible in part by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health, which funded PC under contracts 75N92020C00008 and 75N92020C00021. CB received support from the Swiss Society of Radiology and the Kurt and Senta Herrmann-Foundation, independent of this work. We acknowledge support by Stability.AI in providing computational support for this work. We acknowledge the support of this work by the Wu Tsai Human Performance Alliance at Stanford University and the Joe and Clara Tsai Foundation.


The information provided on this project site, including the generated images, does not contain or is substitute for any professional medical or health advice. The information is provided in good faith for general informational, research and educational purposes only. The authors make no warranty of any kind, express or implied, regarding the accuracy, validity, reliability, or completeness of any information on the site. Before taking any actions based on the presented information, we encourage you to consult with appropriate medical professionals. THE USE OR RELIANCE OF ANY INFORMATION CONTAINED ON THE SITE IS SOLELY AT YOUR OWN RISK.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.
Code used for the image gallery: SimpleLightbox