We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
ProsmORF-pred: a machine learning-based method for the identification of small ORFs in prokaryotic genomes.
- Authors
Khanduja, Akshay; Kumar, Manish; Mohanty, Debasisa
- Abstract
Small open reading frames (smORFs) encoding proteins less than 100 amino acids (aa) are known to be important regulators of key cellular processes. However, their computational identification remains a challenge. Based on a comprehensive analysis of known prokaryotic small ORFs, we have developed the ProsmORF-pred resource which uses a machine learning (ML)-based method for prediction of smORFs in the prokaryotic genome sequences. ProsmORF-pred consists of two ML models, one for initiation site recognition in nucleic acid sequences upstream of putative start codons and the other uses translated amino acid sequences to decipher functional protein like sequences. The nucleotide sequence-based initiation site recognition model has been trained using longer ORFs (>100 aa) in the same genome while the ML model for identification of protein like sequences has been trained using annotated smORFs from Escherichia coli. Comprehensive benchmarking of ProsmORF-pred reveals that its performance is comparable to other state-of-the-art approaches on the annotated smORF set derived from 32 prokaryotic genomes. Its performance is distinctly superior to other tools like PRODIGAL and RANSEPS for prediction of newly identified smORFs which have a length range of 10–30 aa, where prediction of smORFs has been a major challenge. Apart from identification of smORFs in genomic sequences, ProsmORF-pred can also aid in functional annotation of the predicted smORFs based on sequence similarity and genomic neighbourhood similarity searches in ProsmORFDB, a well-curated database of known smORFs. ProsmORF-pred along with its backend database ProsmORFDB is available as a user-friendly web server (http://www.nii.ac.in/prosmorfpred.html).
- Subjects
PROKARYOTIC genomes; MACHINE learning; INTERNET servers; AMINO acid sequence; PROTEOMICS; DATABASES
- Publication
Briefings in Bioinformatics, 2023, Vol 24, Issue 3, p1
- ISSN
1467-5463
- Publication type
Article
- DOI
10.1093/bib/bbad101