Amplicon sequencing is widely used to target regions or genes of interest to improve coverage and allows identification of tissue specific and lowly expressed transcript isoforms. Recently this method was successfully coupled with long-read nanopore sequencing to identify 38 novel exons and 241 novel isoforms in CACNA1C1. However, a standardised workflow to analyse amplicon sequencing data does not exist. Here, we present a novel workflow - “discoAnt” - which optimises existing alignment, transcript identification and quantification programs for a gene specific approach.
discoAnt compiles primary genomic alignments generated with minimap2 which are used to construct transcripts with bambu. These transcripts are filtered using in-house scripts based on the position of primers designed to target the gene of interest and the abundances values estimated based on the merged alignments. The high confidence transcripts form an improved meta transcriptome for the gene which acts as a reference for quantification of transcripts across all samples with salmon.
We have used Spike-in RNA Variants (SIRVs) to benchmark Stringtie2, FLAIR and bambu with SIRV5 as the gene of interest. Only bambu could identify the five targeted SIRV5 transcripts (compared to four in FLAIR and one in Stringtie2). We also observed that one, 255 and six transcripts identified by Stringtie2, FLAIR and bambu, respectively, were false positives novels. discoAnt additionally optimises bambu parameters and applies additional filters to mitigate the false discovery rate, identify accurate splice sites and generates a high confidence transcript list that ensures accurate quantification. We tested discoAnt with two more positive controls (SIRV6 and SIRV7) and GRIA1 to confirm the improvement in transcript identification compared to other programs.
discoAnt is fast, reproducible, and easy-to-use workflow to identify, filter, quantify and annotate transcripts for long-read amplicon sequencing analyses.