We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history.
- Authors
Maurits, Marc P; Korsunsky, Ilya; Raychaudhuri, Soumya; Murphy, Shawn N; Smoller, Jordan W; Weiss, Scott T; Huizinga, Thomas W J; Reinders, Marcel J T; Karlson, Elizabeth W; Akker, Erik B van den; Knevel, Rachel; van den Akker, Erik B
- Abstract
<bold>Objective: </bold>To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects.<bold>Material and Methods: </bold>We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features.<bold>Results: </bold>We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache" clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles.<bold>Discussion: </bold>Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data.<bold>Conclusion: </bold>We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.
- Subjects
RESEARCH; RESEARCH methodology; DIABETES; EVALUATION research; COMPARATIVE studies; RESEARCH funding; CLUSTER analysis (Statistics); PHENOTYPES
- Publication
Journal of the American Medical Informatics Association, 2022, Vol 29, Issue 5, p761
- ISSN
1067-5027
- Publication type
journal article
- DOI
10.1093/jamia/ocac008