We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Merging short and stranded long reads improves transcript assembly.
- Authors
Kainth, Amoldeep S.; Haddad, Gabriela A.; Hall, Johnathon M.; Ruthenburg, Alexander J.
- Abstract
Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to "strand" long-read cDNA libraries that rectifies inaccurate mapping and assembly of long-read transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5' and 3' ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts. Author summary: The study of transcriptomes is pertinent to development, disease and response to stimuli. Transcriptomes have largely been studied via short-read RNA-sequencing. Despite its wide use, this technology falls short of illuminating coherent transcripts, particularly for poorly annotated or low-coverage regions. The advent of long-read sequencing has enabled improved analyses of genomes and transcriptomes; however, a systematic direct comparison of short- and long-read RNA-seq and their seamless amalgamation has not been done yet. Here, we demonstrate that short-read RNA-seq provides higher depth for transcript quantitation whereas long-read RNA-seq provides better qualitative information. We report a widespread cDNA synthesis artifact that can markedly impact transcript assembly and quantitation. We develop a computational pipeline to infer strand-of-origin in the long-read cDNA libraries and demonstrate that its application markedly rectifies the erroneous assembly of transcripts. Combining the stranded long reads with the short reads in our new hybrid pipeline, TASSEL, leads to substantial improvement in the assembly of transcripts. This pipeline can be applied to a wide range of datasets, enabling improved characterization for downstream experimentation.
- Subjects
LINCRNA; GENE expression; RNA sequencing; NUCLEOTIDE sequencing; GENOMES; TRANSCRIPTOMES
- Publication
PLoS Computational Biology, 2023, Vol 19, Issue 10, p1
- ISSN
1553-734X
- Publication type
Article
- DOI
10.1371/journal.pcbi.1011576