We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
EPGA: de novo assembly using the distributions of reads and insert size.
- Authors
Junwei Luo; Jianxin Wang; Zhen Zhang; Fang-Xiang Wu; Min Li; Yi Pan
- Abstract
Motivation: In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results. Results: In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.
- Subjects
GENOMICS; SEQUENCE analysis; TISSUE scaffolds; COMPUTER algorithms; COMPUTATIONAL complexity
- Publication
Bioinformatics, 2015, Vol 31, Issue 6, p825
- ISSN
1367-4803
- Publication type
Article
- DOI
10.1093/bioinformatics/btu762