We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
External Validation of an Ensemble Model for Automated Mammography Interpretation by Artificial Intelligence.
- Authors
Hsu, William; Hippe, Daniel S.; Nakhaei, Noor; Wang, Pin-Chieh; Zhu, Bing; Siu, Nathan; Ahsen, Mehmet Eren; Lotter, William; Sorensen, A. Gregory; Naeim, Arash; Buist, Diana S. M.; Schaffter, Thomas; Guinney, Justin; Elmore, Joann G.; Lee, Christoph I.
- Abstract
This diagnostic study evaluates an ensemble artificial intelligence model for automated interpretation of screening mammography in a diverse population. Key Points: Question: Will a high-performing ensemble of artificial intelligence (AI) models for automated interpretation of screening mammography generalize to a diverse population? Findings: In this diagnostic study using 37 317 examinations from 26 817 women seen at a geographically distributed screening program, a previously validated ensemble model had a decline in performance compared with its reported performance in other, more homogeneous cohorts. When combined with a radiologist assessment, ensemble performance was similar to that of the radiologist, but worse performance was noted in subgroups, particularly Hispanic women and women with a personal history of breast cancer. Meaning: These findings suggest that AI models, including those trained on large data sets or constructed using ensemble methods, may be at risk of underspecification and poor generalizability. Importance: With a shortfall in fellowship-trained breast radiologists, mammography screening programs are looking toward artificial intelligence (AI) to increase efficiency and diagnostic accuracy. External validation studies provide an initial assessment of how promising AI algorithms perform in different practice settings. Objective: To externally validate an ensemble deep-learning model using data from a high-volume, distributed screening program of an academic health system with a diverse patient population. Design, Setting, and Participants: In this diagnostic study, an ensemble learning method, which reweights outputs of the 11 highest-performing individual AI models from the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods (DREAM) Mammography Challenge, was used to predict the cancer status of an individual using a standard set of screening mammography images. This study was conducted using retrospective patient data collected between 2010 and 2020 from women aged 40 years and older who underwent a routine breast screening examination and participated in the Athena Breast Health Network at the University of California, Los Angeles (UCLA). Main Outcomes and Measures: Performance of the challenge ensemble method (CEM) and the CEM combined with radiologist assessment (CEM+R) were compared with diagnosed ductal carcinoma in situ and invasive cancers within a year of the screening examination using performance metrics, such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Results: Evaluated on 37 317 examinations from 26 817 women (mean [SD] age, 58.4 [11.5] years), individual model AUROC estimates ranged from 0.77 (95% CI, 0.75-0.79) to 0.83 (95% CI, 0.81-0.85). The CEM model achieved an AUROC of 0.85 (95% CI, 0.84-0.87) in the UCLA cohort, lower than the performance achieved in the Kaiser Permanente Washington (AUROC, 0.90) and Karolinska Institute (AUROC, 0.92) cohorts. The CEM+R model achieved a sensitivity (0.813 [95% CI, 0.781-0.843] vs 0.826 [95% CI, 0.795-0.856]; P =.20) and specificity (0.925 [95% CI, 0.916-0.934] vs 0.930 [95% CI, 0.929-0.932]; P =.18) similar to the radiologist performance. The CEM+R model had significantly lower sensitivity (0.596 [95% CI, 0.466-0.717] vs 0.850 [95% CI, 0.766-0.923]; P <.001) and specificity (0.803 [95% CI, 0.734-0.861] vs 0.945 [95% CI, 0.936-0.954]; P <.001) than the radiologist in women with a prior history of breast cancer and Hispanic women (0.894 [95% CI, 0.873-0.910] vs 0.926 [95% CI, 0.919-0.933]; P =.004). Conclusions and Relevance: This study found that the high performance of an ensemble deep-learning model for automated screening mammography interpretation did not generalize to a more diverse screening cohort, suggesting that the model experienced underspecification. This study suggests the need for model transparency and fine-tuning of AI models for specific target populations prior to their clinical adoption.
- Subjects
DEEP learning; RESEARCH; CONFIDENCE intervals; MATHEMATICAL models; ARTIFICIAL intelligence; MAMMOGRAMS; RETROSPECTIVE studies; AUTOMATION; THEORY; DESCRIPTIVE statistics; DATA analysis software; SENSITIVITY &; specificity (Statistics)
- Publication
JAMA Network Open, 2022, Vol 5, Issue 11, pe2242343
- ISSN
2574-3805
- Publication type
Article
- DOI
10.1001/jamanetworkopen.2022.42343