Institute of Bioinformatics Münster
Supplementary Materials

Table S1: Results for dengue virus serotyping by NanoPipe based on "2D", "high-quality" nanopore reads generated from four patient's serum samples. The numbers indicate how many of the nanopore reads aligned to the respective serotype. Reads from both the pass and the fail folder were used. The serotype identified via LAMP method is given.

Sample Identification by LAMP Pass folder Fail folder
ser. 1 ser. 2 ser. 3 ser. 4 total output ser. 1 ser. 2 ser. 3 ser. 4 total output
A ser. 1 57 0 0 0 61 here 2027 0 4 1 3562 here
B ser. 2 0 508 0 0 863 here 0 5550 2 6 7696 here
C ser. 3 0 0 1798 0 7300 here 0 0 5777 0 7362 here
D ser. 4 0 0 0 389 400 here 0 0 0 1627 2277 here

Table S2: Number of potential SNPs detected by NanoPipe in "2D", high-quality" nanopore reads for Plasmodium falciparum. Known SNPs refer to those that are also given in either the publication by Nair et al. (2014), the MalariaGEN v.3 dataset (Amato et al., 2015), or in both resources.

Sample Pass folder Fail folder
Detected SNPs Known SNPs output Detected SNPs Known SNPs output
A 29 8 here 39 8 here
B 26 12 here 58 13 here
C 30 13 here 83 18 here

Table S3: Parameters used in NanoPipe for the lastal program in the LAST software, and default values for version 587 of lastal.

Discovery Task
Plasmodium polymorphisms Dengue virus serotype classification Provide target file LAST default parameters
Substitution Matrix ATMAP - - -
Match Score (-r) - 6 6 6
Mismatch Cost (-q) 12 12 18
Gap Existence Cost (-a) 12 12 12 21
Gap Extension Cost (-b) 3 4 4 9
Insertion Existence Cost (-A) 15 15 15 (a)
Query Letters per Random Alignment (-D) 1.00E+07 1.00E+06 1.00E+06 1.00E+06
Maximum Initial Matches per Query Position (-m) 100 100 100 10
Output Type (-j) 4 4 4 3
Input Format (-Q) 1 1 1 -

Table S4: Target sequences used for P. falciparum sequence analysis and their chromosome positions, strand, length and GC content.

target name gene name chromosome start end strand length GC %
apocytochrome_b apocytochrome_b M76611 3438 4680 + 1243 27.35
PfATPase6_1 PfATPase6 Pf3D7_01_v3 267134 269239 - 2106 26.54
PfATPase6_2 PfATPase6 Pf3D7_01_v3 264641 267470 + 2830 23.36
pfmrp1_1 pfmrp1 Pf3D7_01_v3 464622 467289 + 2668 21.51
pfmrp1_2 pfmrp1 Pf3D7_01_v3 466960 470216 - 3257 22.6
DHFR-TS DHFR-TS Pf3D7_04_v3 747923 749956 + 2034 24.14
PfTCTP PfTCTP Pf3D7_05_v3 467406 468316 - 911 20.86
pfmdr1 pfmdr1 Pf3D7_05_v3 957756 962218 + 4463 24.22
PfCRT_1 PfCRT Pf3D7_07_v3 403089 404828 + 1740 19.31
PfCRT_2 PfCRT Pf3D7_07_v3 404757 406466 - 1710 19.36
DHPS DHPS Pf3D7_08_v3 548039 550780 + 2742 21.33
ABC_transporter ABC_transporter Pf3D7_08_v3 670708 675573 - 4866 19.77
K13-propeller K13-propeller Pf3D7_13_v3 1724572 1727035 - 2464 24.92

Details on NanoPipe

Data uploading and conversion

The user can select one of three analysis tasks ("Plasmodium polymorphisms", "Dengue virus serotype classification" or "Provide target file"). Depending on the task chosen, the alignment parameters for LAST are adapted (see section 2.2 and table S1) and the option to upload a "target file" becomes available.

NanoPipe accepts fastq and fast51 files as input, which can be provided as a single file, or as a compressed directory (as .zip or .tar.gz archive). Files in fast5 format are converted into fastq format using poretools (Loman, N. J., & Quinlan, A. R., 2014), with this command:

poretools fastq --type 2D --high-quality input.fast5

The option --type 2D specifies that only those reads are used, which are labelled by Metrichor as "twodirections", these are the reads for which the MinIONTM has read both the "template" and "complement" strand. With the option --high-quality only those reads are kept, for which the complement strand has more events than the template strand, i.e. there is the complete sequence for the complement strand available.

In case the task "Provide target file" is selected, the user has to upload a file with target sequences, which has to be in plain text fasta format. NanoPipe will then run the program lastdb from the LAST aligner software package (Kiełbasa, S. M. et al., 2011) to prepare these target sequences for the subsequent alignment to the query files, without specifying any additional parameters.

For the other tasks ("Plasmodium polymorphisms" and "Dengue virus serotype classification") the prepared target sequences are stored in the NanoPipe software's data folder.

Alignment of nanopore reads against the target sequences

The alignment of the nanopore reads to the target sequences is performed with the program lastal from the LAST aligner software package. The command used is:

lastal parameter lastdb_path input.fastq

where parameter stands for the parameters which are set specifically for the analysis task chosen, lastdb_path stands for the target sequences that were prepared using the lastdb program and input.fastq is the query file specified by the user. The resulting alignment in maf format is then used as an input to the last-split program, without giving any additional options.

Alignment analysis

The analysis of the alignments is performed with Python and R scripts and divided into several steps:

Alignment file conversion and filtering

The output of last-split is converted into a tabular format for easier subsequent analyses. This tabular format is filtered, so that for each nanopore read only the alignment with the highest "bit score" is kept.

Counting of aligned nucleotides in nanopore reads

Using this data, a python script then counts the nucleotides in the aligned nanopore reads at each target sequence position. This results in a tabular file that has as many rows as there are nucleotides in the target sequences, where each row gives the number of each nucleotide.

Consensus calling

Afterwards, the consensus sequence for each target sequence is called by finding the most common nucleotide for each target sequence position with an R script. If there is no single most common nucleotide, the IUPAC nucleotide ambiguity notation is used. A consensus nucleotide is only given, if there are at least 10 nanopore reads aligning to this position in the target sequence. If there are less than 10 nanopore reads or there is no single most common nucleotide, "N" is given as the consensus.

SNP calling

SNPs are called similarly, by filtering the dataset for positions in the alignment, where there are at least 10 nanopore reads and the consensus nucleotide is not identical to the original target sequence nucleotide at the same position.

Intersecting called SNPs with SNPs from literature ("known SNPs")

The NanoPipe identifies those P. falciparum SNPs in the query sequences that also appear in the MalariaGEN tool (Amato et al., 2015) and the publication by Nair et al. (2014). The NanoPipe software's data folder contains a tabular file with all SNPs from these sources that cover the analyzed target sequences. This file is then intersected with the polymorphisms identified in the query sequences, thus a new table is generated that gives all "known SNPs" found in the query sequences.

Calculation of number of nanopore reads and read lengths per target

To get the number of nanopore reads that aligned to each target sequence, a short shell command is used on the filtered alignment file (see section 2.3.1) that counts how often each target sequence's name appears in the file.

Results display

The lengths of the nanopore reads are plotted with R. Furthermore, "logo plots" are also generated with R that give an overview of the alignment and the sequence depth. In the logo plots, each position on the x axis represents one nucleotide in the target sequence. The aligning nanopore reads are displayed on the y axis, with each nucleotide being displayed in one color (red for "A", green for "C", yellow for "G", blue for "T" and grey for a gap). Thus, if one bar at a given position in the target is colored completely red, this means that at this position of the target sequence, all aligning nanopore reads at this position contain an "A". The target sequence is given color-coded in the blocks positioned on the bottom, and the consensus is given on top of the target sequence.

To show the results on the NanoPipe website, a Python script generates the HTML code for the results display.

References

Amato, R., Miotto, O., Woodrow, C., Almagro-Garcia, J., Sinha, I., Campino, S., … Kwiatkowski, D. P. (2015). Genomic epidemiology of the current wave of artemisinin resistant malaria. doi:10.1101/019737

Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3), 487–93. doi:10.1101/gr.113985.110

Loman, N. J., & Quinlan, A. R. (2014). Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics (Oxford, England), 30(23), 3399–3401. doi:10.1093/bioinformatics/btu555

Nair, S., Nkhoma, S. C., Serre, D., Zimmerman, P. A., Gorena, K., Daniel, B. J., … Cheeseman, I. H. (2014). Single-cell genomics for dissection of complex malaria infections. Genome Research, 24(6), 1028–38. doi:10.1101/gr.168286.113


  1. fast5 is a HDF-based file format used by Oxford Nanopore Technologies for the reads sequenced by the MinIONTM

2015-11-04 11:33