We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study.
- Authors
Iwagami, Masao; Inokuchi, Ryota; Kawakami, Eiryo; Yamada, Tomohide; Goto, Atsushi; Kuno, Toshiki; Hashimoto, Yohei; Michihata, Nobuaki; Goto, Tadahiro; Shinozaki, Tomohiro; Sun, Yu; Taniguchi, Yuta; Komiyama, Jun; Uda, Kazuaki; Abe, Toshikazu; Tamiya, Nanako
- Abstract
It is expected but unknown whether machine-learning models can outperform regression models, such as a logistic regression (LR) model, especially when the number and types of predictor variables increase in electronic health records (EHRs). We aimed to compare the predictive performance of gradient-boosted decision tree (GBDT), random forest (RF), deep neural network (DNN), and LR with the least absolute shrinkage and selection operator (LR-LASSO) for unplanned readmission. We used EHRs of patients discharged alive from 38 hospitals in 2015–2017 for derivation and in 2018 for validation, including basic characteristics, diagnosis, surgery, procedure, and drug codes, and blood-test results. The outcome was 30-day unplanned readmission. We created six patterns of data tables having different numbers of binary variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. For each pattern of data tables, we used the derivation data to establish the machine-learning and LR models, and used the validation data to evaluate the performance of each model. The incidence of outcome was 6.8% (23,108/339,513 discharges) and 6.4% (7,507/118,074 discharges) in the derivation and validation datasets, respectively. For the first data table with the smallest number of variables (102 variables that ≥5% of patients had, without blood-test results), the c-statistic was highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For the last data table with the largest number of variables (1543 variables that ≥10 patients had, including blood-test results), the c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720), suggesting that the difference between GBDT and LR-LASSO was small and their 95% confidence intervals overlapped. In conclusion, GBDT generally outperformed LR-LASSO to predict unplanned readmission, but the difference of c-statistic became smaller as the number of variables was increased and blood-test results were used. Author summary: It has been controversial over whether machine-learning models can outperform traditional statistical models, such as a logistic regression (LR) model, for the prediction of hospital readmission in electronic health records (EHRs). Therefore, this study aimed to systematically compare the predictive performance of the 30-day unplanned readmission among several machine-learning models and a LR model. We created 6 patterns of data tables according to the number of binary predictor variables (that ≥5% or ≥1% of patients, or ≥10 patients had) with and without blood-test results, expecting that some machine-learning models may outperform the LR model more prominently if the data become richer. We found that the gradient-boosting decision tree (one of machine-learning models) generally outperformed the LR model. However, against our expectation, the difference in the predictive performance between them was smaller in the last data table with the largest number of variables (1543 variables including blood-test results). Thus, this study concludes that the superiority of machine-learning methods to traditional statistical models may not be larger in EHRs with richer information. Future studies should focus on other potential predictors in EHRs, such as images and processed natural language, for demonstrating the superior performance of machine-learning methods to traditional statistical models.
- Subjects
JAPAN; RANDOM forest algorithms; PREDICTION models; BLOOD testing; RECEIVER operating characteristic curves; RESEARCH funding; PATIENT readmissions; LOGISTIC regression analysis; RETROSPECTIVE studies; DISCHARGE planning; HOSPITALS; DESCRIPTIVE statistics; LONGITUDINAL method; ELECTRONIC health records; ARTIFICIAL neural networks; RESEARCH methodology; MACHINE learning; DECISION trees; CONFIDENCE intervals; DATA analysis software
- Publication
PLoS Digital Health, 2024, Vol 3, Issue 8, p1
- ISSN
2767-3170
- Publication type
Article
- DOI
10.1371/journal.pdig.0000578