Adamson |
Desai |
Dominic |
Bluethgen |
Varma |
Wood |
Syed |
Boutin |
Stevens |
Vasanawala |
Pauly |
Gunel |
Chaudhari |
Stanford University
![]() |
|
|
|
|
Purpose. Commonly used MR image quality (IQ) metrics have poor concordance with radiologist-perceived diagnostic IQ. Here, we develop and explore deep feature distances (DFDs)—distances computed in a lower-dimensional feature space encoded by a convolutional neural network (CNN)—as improved perceptual IQ metrics for MR image reconstruction. We further explore the impact of distribution shifts between images in the DFD CNN encoder training data and the IQ metric evaluation. Methods. We compare commonly used IQ metrics (PSNR and SSIM) to two “out-of-domain” DFDs with encoders trained on natural images, an “in-domain” DFD trained on MR images alone, and two domain-adjacent DFDs trained on large medical imaging datasets. We additionally compare these with several state-of-the-art but less commonly reported IQ metrics, visual information fidelity (VIF), noise quality metric (NQM), and the high-frequency error norm (HFEN). IQ metric performance is assessed via correlations with five expert radiologist reader scores of perceived diagnostic IQ of various accelerated MR image reconstructions. We characterize the behavior of these IQ metrics under common distortions expected during image acquisition, including their sensitivity to acquisition noise. Results. All DFDs and HFEN correlate more strongly with radiologist-perceived diagnostic IQ than SSIM, PSNR, and other state-of-the-art metrics, with correlations being comparable to radiologist inter-reader variability. Surprisingly, out-of-domain DFDs perform comparably to in-domain and domain-adjacent DFDs. Conclusion. A suite of IQ metrics, including DFDs and HFEN, should be used alongside commonly-reported IQ metrics for a more holistic evaluation of MR image reconstruction perceptual quality. We also observe that general vision encoders are capable of assessing visual IQ even for MR images. |
![]() |
Mean reader score correlations with commonly used (PSNR, SSIM, and NRMSE), less commonly used state-of-the-art (VIF, NQM, HFEN), and DFD IQ metrics based on encoder training data domain. Aliasing reader score SROCC values are shown on the left with cartilage and meniscus reader score SROCCs on the right. DFDs and HFEN outperform VIF and NQM, which in turn outperform SSIM and PSNR. The DFDs and HFEN correlations are comparable to the inter-reader variability between the five readers (shown in black +-1 standard deviation), but out-of-domain DFDs perform comparably as an MR image reconstruction IQ metric to in-domain and domain-adjacent DFDs. |
![]() |
Example commonly used (SSIM, PSNR), state-of-the-art (VIF), out-of-domain DFD (LPIPS), in-domain DFD (SSFD) and domain-adjacent DFD (RINFD) IQ metric values versus mean reader scores for aliasing (top) and cartilage/mensiscus assessment (bottom). Each point corresponds to a single image taken from the center slice from 61 MR image reconstruction images, each with 2 (blue), 4 (black), and 6 (orange) accelerations with a UNet (circle) and unrolled (X's) networks. Higher reader score values correspond to better radiologist-perceived IQ. |
|
![]() |
Mean reader score correlations as a function of acquisition noise in the reference data to commonly used (SSIM, PSNR, NRMSE), less commonly used state-of-the-art (VIF, NQM, HFEN), and DFD IQ metrics based on encoder training data domain. Complex Gaussian noise is added to the original k-space data by scaling the inter-channel covariance matrix. Aliasing reader score SROCC values are shown on top with cartilage and meniscus reader score SROCCs on the bottom. DFDs and HFEN outperform other IQ metrics as acquisition noise increases, while LPIPS, RINFD, and HFEN still match inter-reader variability (shown in black ±1 standard deviation) even when using reference scans with four times the original acquisition noise. |
|
![]() |
IQ metrics vs. perturbation extent, chosen to achieve comparable changes in SSIM, for increasing Gaussian noise, blurring, translation, and motion artifacts (bottom). Example images demonstrate each perturbation type at the extent indicated by the arrow, where the roll image is the difference between the perturbed and original scan (top). All IQ metrics are sensitive to clinically realistic image corruptions, while DFDs are relatively insensitive to pixel shifts (e.g., HFEN, yellow versus RINFD, green). |
![]() |
All data used in this study has been made public. Above is an example of the six reconstruction techniques (left), and plots of their IQ metrics versus mean of 5 radiologist reader scores assessed for the presence of aliasing and diagnostic quality of the cartilage and meniscus (right). The center slice from all 361 MR images for the (3 accelerations) x (2 methods) x (61 patients), along with corresponding reader scores from each of the 5 radiologists are available for download here. |
![]() |
Philip M. Adamson, Arjun D. Desai, Jeffrey Dominic, Maya Varma, Christian Bluethgen, Jeff P. Wood, Ali B. Syed, Robert D. Boutin, Kathryn J. Stevens, Shreyas Vasanawala, John M. Pauly, Beliz Gunel, and Akshay S. Chaudhari Using Deep Feature Distances for Evaluating the Perceptual Quality of MR Image Reconstructions. Magnetic Resonance in Medicine, 2025. (Link) |
AcknowledgementsThis work was supported by NIH R01 AR077604, R01EB002524, R01 AR079431. It was also supported by the Radiological Sciences Laboratory Seed Grant from Stanford University, and the NSF Graduate Research Fellowship under Grant No. DGE-2146755. |