Table S1: Results for dengue virus serotyping by NanoPipe based on "2D", "high-quality" nanopore reads generated from four patient's serum samples. The numbers indicate how many of the nanopore reads aligned to the respective serotype. Reads from both the pass and the fail folder were used. The serotype identified via LAMP method is given.
Sample | Identification by LAMP | Pass folder | Fail folder | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ser. 1 | ser. 2 | ser. 3 | ser. 4 | total | output | ser. 1 | ser. 2 | ser. 3 | ser. 4 | total | output | ||||
A | ser. 1 | 57 | 0 | 0 | 0 | 61 | here | 2027 | 0 | 4 | 1 | 3562 | here | ||
B | ser. 2 | 0 | 508 | 0 | 0 | 863 | here | 0 | 5550 | 2 | 6 | 7696 | here | ||
C | ser. 3 | 0 | 0 | 1798 | 0 | 7300 | here | 0 | 0 | 5777 | 0 | 7362 | here | ||
D | ser. 4 | 0 | 0 | 0 | 389 | 400 | here | 0 | 0 | 0 | 1627 | 2277 | here |
Table S2: Number of potential SNPs detected by NanoPipe in "2D", high-quality" nanopore reads for Plasmodium falciparum. Known SNPs refer to those that are also given in either the publication by Nair et al. (2014), the MalariaGEN v.3 dataset (Amato et al., 2015), or in both resources.
Sample | Pass folder | Fail folder | ||||||
---|---|---|---|---|---|---|---|---|
Detected SNPs | Known SNPs | output | Detected SNPs | Known SNPs | output | |||
A | 29 | 8 | here | 39 | 8 | here | ||
B | 26 | 12 | here | 58 | 13 | here | ||
C | 30 | 13 | here | 83 | 18 | here |
Table S3: Parameters used in NanoPipe for the lastal program in the LAST software, and default values for version 587 of lastal.
Discovery Task | ||||
---|---|---|---|---|
Plasmodium polymorphisms | Dengue virus serotype classification | Provide target file | LAST default parameters | |
Substitution Matrix | ATMAP | - | - | - |
Match Score (-r) | - | 6 | 6 | 6 |
Mismatch Cost (-q) | 12 | 12 | 18 | |
Gap Existence Cost (-a) | 12 | 12 | 12 | 21 |
Gap Extension Cost (-b) | 3 | 4 | 4 | 9 |
Insertion Existence Cost (-A) | 15 | 15 | 15 | (a) |
Query Letters per Random Alignment (-D) | 1.00E+07 | 1.00E+06 | 1.00E+06 | 1.00E+06 |
Maximum Initial Matches per Query Position (-m) | 100 | 100 | 100 | 10 |
Output Type (-j) | 4 | 4 | 4 | 3 |
Input Format (-Q) | 1 | 1 | 1 | - |
Table S4: Target sequences used for P. falciparum sequence analysis and their chromosome positions, strand, length and GC content.
target name | gene name | chromosome | start | end | strand | length | GC % |
---|---|---|---|---|---|---|---|
apocytochrome_b | apocytochrome_b | M76611 | 3438 | 4680 | + | 1243 | 27.35 |
PfATPase6_1 | PfATPase6 | Pf3D7_01_v3 | 267134 | 269239 | - | 2106 | 26.54 |
PfATPase6_2 | PfATPase6 | Pf3D7_01_v3 | 264641 | 267470 | + | 2830 | 23.36 |
pfmrp1_1 | pfmrp1 | Pf3D7_01_v3 | 464622 | 467289 | + | 2668 | 21.51 |
pfmrp1_2 | pfmrp1 | Pf3D7_01_v3 | 466960 | 470216 | - | 3257 | 22.6 |
DHFR-TS | DHFR-TS | Pf3D7_04_v3 | 747923 | 749956 | + | 2034 | 24.14 |
PfTCTP | PfTCTP | Pf3D7_05_v3 | 467406 | 468316 | - | 911 | 20.86 |
pfmdr1 | pfmdr1 | Pf3D7_05_v3 | 957756 | 962218 | + | 4463 | 24.22 |
PfCRT_1 | PfCRT | Pf3D7_07_v3 | 403089 | 404828 | + | 1740 | 19.31 |
PfCRT_2 | PfCRT | Pf3D7_07_v3 | 404757 | 406466 | - | 1710 | 19.36 |
DHPS | DHPS | Pf3D7_08_v3 | 548039 | 550780 | + | 2742 | 21.33 |
ABC_transporter | ABC_transporter | Pf3D7_08_v3 | 670708 | 675573 | - | 4866 | 19.77 |
K13-propeller | K13-propeller | Pf3D7_13_v3 | 1724572 | 1727035 | - | 2464 | 24.92 |
The user can select one of three analysis tasks ("Plasmodium polymorphisms", "Dengue virus serotype classification" or "Provide target file"). Depending on the task chosen, the alignment parameters for LAST are adapted (see section 2.2 and table S1) and the option to upload a "target file" becomes available.
NanoPipe accepts fastq and fast51 files as input, which can be provided as a single file, or as a compressed directory (as .zip or .tar.gz archive). Files in fast5 format are converted into fastq format using poretools (Loman, N. J., & Quinlan, A. R., 2014), with this command:
poretools fastq --type 2D --high-quality input.fast5
The option --type 2D specifies that only those reads are used, which are labelled by Metrichor as "twodirections", these are the reads for which the MinIONTM has read both the "template" and "complement" strand. With the option --high-quality only those reads are kept, for which the complement strand has more events than the template strand, i.e. there is the complete sequence for the complement strand available.
In case the task "Provide target file" is selected, the user has to upload a file with target sequences, which has to be in plain text fasta format. NanoPipe will then run the program lastdb from the LAST aligner software package (Kiełbasa, S. M. et al., 2011) to prepare these target sequences for the subsequent alignment to the query files, without specifying any additional parameters.
For the other tasks ("Plasmodium polymorphisms" and "Dengue virus serotype classification") the prepared target sequences are stored in the NanoPipe software's data folder.
The alignment of the nanopore reads to the target sequences is performed with the program lastal from the LAST aligner software package. The command used is:
lastal parameter lastdb_path input.fastq
where parameter stands for the parameters which are set specifically for the analysis task chosen, lastdb_path stands for the target sequences that were prepared using the lastdb program and input.fastq is the query file specified by the user. The resulting alignment in maf format is then used as an input to the last-split program, without giving any additional options.
The analysis of the alignments is performed with Python and R scripts and divided into several steps:
The output of last-split is converted into a tabular format for easier subsequent analyses. This tabular format is filtered, so that for each nanopore read only the alignment with the highest "bit score" is kept.
Using this data, a python script then counts the nucleotides in the aligned nanopore reads at each target sequence position. This results in a tabular file that has as many rows as there are nucleotides in the target sequences, where each row gives the number of each nucleotide.
Afterwards, the consensus sequence for each target sequence is called by finding the most common nucleotide for each target sequence position with an R script. If there is no single most common nucleotide, the IUPAC nucleotide ambiguity notation is used. A consensus nucleotide is only given, if there are at least 10 nanopore reads aligning to this position in the target sequence. If there are less than 10 nanopore reads or there is no single most common nucleotide, "N" is given as the consensus.
SNPs are called similarly, by filtering the dataset for positions in the alignment, where there are at least 10 nanopore reads and the consensus nucleotide is not identical to the original target sequence nucleotide at the same position.
The NanoPipe identifies those P. falciparum SNPs in the query sequences that also appear in the MalariaGEN tool (Amato et al., 2015) and the publication by Nair et al. (2014). The NanoPipe software's data folder contains a tabular file with all SNPs from these sources that cover the analyzed target sequences. This file is then intersected with the polymorphisms identified in the query sequences, thus a new table is generated that gives all "known SNPs" found in the query sequences.
To get the number of nanopore reads that aligned to each target sequence, a short shell command is used on the filtered alignment file (see section 2.3.1) that counts how often each target sequence's name appears in the file.
The lengths of the nanopore reads are plotted with R. Furthermore, "logo plots" are also generated with R that give an overview of the alignment and the sequence depth. In the logo plots, each position on the x axis represents one nucleotide in the target sequence. The aligning nanopore reads are displayed on the y axis, with each nucleotide being displayed in one color (red for "A", green for "C", yellow for "G", blue for "T" and grey for a gap). Thus, if one bar at a given position in the target is colored completely red, this means that at this position of the target sequence, all aligning nanopore reads at this position contain an "A". The target sequence is given color-coded in the blocks positioned on the bottom, and the consensus is given on top of the target sequence.
To show the results on the NanoPipe website, a Python script generates the HTML code for the results display.
Amato, R., Miotto, O., Woodrow, C., Almagro-Garcia, J., Sinha, I., Campino, S., … Kwiatkowski, D. P. (2015). Genomic epidemiology of the current wave of artemisinin resistant malaria. doi:10.1101/019737
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3), 487–93. doi:10.1101/gr.113985.110
Loman, N. J., & Quinlan, A. R. (2014). Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics (Oxford, England), 30(23), 3399–3401. doi:10.1093/bioinformatics/btu555
Nair, S., Nkhoma, S. C., Serre, D., Zimmerman, P. A., Gorena, K., Daniel, B. J., … Cheeseman, I. H. (2014). Single-cell genomics for dissection of complex malaria infections. Genome Research, 24(6), 1028–38. doi:10.1101/gr.168286.113