We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
CTD: An information-theoretic algorithm to interpret sets of metabolomic and transcriptomic perturbations in the context of graphical models.
- Authors
Thistlethwaite, Lillian R.; Petrosyan, Varduhi; Li, Xiqi; Miller, Marcus J.; Elsea, Sarah H.; Milosavljevic, Aleksandar
- Abstract
We consider the following general family of algorithmic problems that arises in transcriptomics, metabolomics and other fields: given a weighted graph G and a subset of its nodes S, find subsets of S that show significant connectedness within G. A specific solution to this problem may be defined by devising a scoring function, the Maximum Clique problem being a classic example, where S includes all nodes in G and where the score is defined by the size of the largest subset of S fully connected within G. Major practical obstacles for the plethora of algorithms addressing this type of problem include computational efficiency and, particularly for more complex scores which take edge weights into account, the computational cost of permutation testing, a statistical procedure required to obtain a bound on the p-value for a connectedness score. To address these problems, we developed CTD "Connect the Dots", a fast algorithm based on data compression that detects highly connected subsets within S. CTD provides information-theoretic upper bounds on p-values when S contains a small fraction of nodes in G without requiring computationally costly permutation testing. We apply the CTD algorithm to interpret multi-metabolite perturbations due to inborn errors of metabolism and multi-transcript perturbations associated with breast cancer in the context of disease-specific Gaussian Markov Random Field networks learned directly from respective molecular profiling data. Author summary: A frequently encountered "omic" analysis problem is to identify a subset of nodes within a weighted graph G that is both highly connected in G and belongs to S, a subset of nodes in G. For example, G may represent a biological pathway, kinetic network model, biological interaction network, or a network learned directly from data, where edges represent co-variation relationships between abundances of molecular variables. S may be the set of molecular variables that are perturbed in an individual case or in a set of disease cases relative to controls. In this work, we develop a novel information-theoretic formulation of this problem and a local search algorithm that obviate the need for computationally costly permutation testing, a statistical procedure which is typically required to establish rigorous p-value bounds for other scoring-based methods.
- Subjects
GAUSSIAN Markov random fields; WEIGHTED graphs; INBORN errors of metabolism; ALGORITHMS; DATA compression; METABOLOMICS; BIOLOGICAL networks
- Publication
PLoS Computational Biology, 2021, Vol 17, Issue 1, p1
- ISSN
1553-734X
- Publication type
Article
- DOI
10.1371/journal.pcbi.1008550