Handorf, Elizabeth; Yin, Yinuo; Slifker, Michael; Lynch, Shannon

doi:10.1186/s12874-020-01183-9

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.
Authors: Handorf, Elizabeth; Yin, Yinuo; Slifker, Michael; Lynch, Shannon
Abstract: <bold>Background: </bold>Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.<bold>Methods: </bold>We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer.<bold>Results: </bold>In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings.<bold>Conclusions: </bold>This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.
Subjects: REGRESSION trees; MACHINE learning; HIERARCHICAL clustering (Cluster analysis); ALACHLOR; RANK correlation (Statistics); COMPUTER simulation; RESEARCH; RESEARCH methodology; MEDICAL cooperation; EVALUATION research; COMPARATIVE studies; RESEARCH funding; PROBABILITY theory
Publication: BMC Medical Research Methodology, 2020, Vol 20, Issue 1, p1
ISSN: 1471-2288
Publication type: journal article
DOI: 10.1186/s12874-020-01183-9

We found a match

Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.

Handorf, Elizabeth; Yin, Yinuo; Slifker, Michael; Lynch, Shannon

REGRESSION trees; MACHINE learning; HIERARCHICAL clustering (Cluster analysis); ALACHLOR; RANK correlation (Statistics); COMPUTER simulation; RESEARCH; RESEARCH methodology; MEDICAL cooperation; EVALUATION research; COMPARATIVE studies; RESEARCH funding; PROBABILITY theory

BMC Medical Research Methodology, 2020, Vol 20, Issue 1, p1

1471-2288

journal article

10.1186/s12874-020-01183-9