We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers.
- Authors
Jiang, Peng; Luo, Jie; Wang, Yiqi; Deng, Pingji; Schmidt, Bertil; Tang, Xiangjun; Chen, Ningjiang; Wong, Limsoon; Zhao, Liang
- Abstract
Motivation K -mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc. attracting intensive studies in k -mer counting. However, the output of k -mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k -mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k -mers by coupled-bit arrays—one for k -mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.
- Subjects
INTERNET servers; SEQUENCE alignment; ERROR correction (Information theory); MAGNITUDE (Mathematics); DATA structures
- Publication
Bioinformatics, 2019, Vol 35, Issue 23, p4871
- ISSN
1367-4803
- Publication type
Article
- DOI
10.1093/bioinformatics/btz299