We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
On the Best Way to Cluster NCI-60 Molecules.
- Authors
Hernández-Hernández, Saiveth; Ballester, Pedro J.
- Abstract
Machine learning-based models have been widely used in the early drug-design pipeline. To validate these models, cross-validation strategies have been employed, including those using clustering of molecules in terms of their chemical structures. However, the poor clustering of compounds will compromise such validation, especially on test molecules dissimilar to those in the training set. This study aims at finding the best way to cluster the molecules screened by the National Cancer Institute (NCI)-60 project by comparing hierarchical, Taylor–Butina, and uniform manifold approximation and projection (UMAP) clustering methods. The best-performing algorithm can then be used to generate clusters for model validation strategies. This study also aims at measuring the impact of removing outlier molecules prior to the clustering step. Clustering results are evaluated using three well-known clustering quality metrics. In addition, we compute an average similarity matrix to assess the quality of each cluster. The results show variation in clustering quality from method to method. The clusters obtained by the hierarchical and Taylor–Butina methods are more computationally expensive to use in cross-validation strategies, and both cluster the molecules poorly. In contrast, the UMAP method provides the best quality, and therefore we recommend it to analyze this highly valuable dataset.
- Subjects
MOLECULES; MODEL validation; CHEMICAL structure; HIERARCHICAL clustering (Cluster analysis); SMALL molecules
- Publication
Biomolecules (2218-273X), 2023, Vol 13, Issue 3, p498
- ISSN
2218-273X
- Publication type
Article
- DOI
10.3390/biom13030498