Institute of Bioinformatics Münster

Supplementary Materials

Table S1: Results for dengue virus serotyping by NanoPipe based on "2D", "high-quality" nanopore reads generated from four patient's serum samples. The numbers indicate how many of the nanopore reads aligned to the respective serotype. Reads from both the pass and the fail folder were used. The serotype identified via LAMP method is given.

Sample	Identification by LAMP	Pass folder							Fail folder
		ser. 1	ser. 2	ser. 3	ser. 4	total	output	ser. 1	ser. 2	ser. 3	ser. 4	total	output
A	ser. 1	57	0	0	0	61	here	2027	0	4	1	3562	here
B	ser. 2	0	508	0	0	863	here	0	5550	2	6	7696	here
C	ser. 3	0	0	1798	0	7300	here	0	0	5777	0	7362	here
D	ser. 4	0	0	0	389	400	here	0	0	0	1627	2277	here

Table S2: Number of potential SNPs detected by NanoPipe in "2D", high-quality" nanopore reads for Plasmodium falciparum. Known SNPs refer to those that are also given in either the publication by Nair et al. (2014), the MalariaGEN v.3 dataset (Amato et al., 2015), or in both resources.

Sample	Pass folder				Fail folder
	Detected SNPs	Known SNPs	output	Detected SNPs	Known SNPs	output
A	29	8	here	39	8	here
B	26	12	here	58	13	here
C	30	13	here	83	18	here

Table S3: Parameters used in NanoPipe for the lastal program in the LAST software, and default values for version 587 of lastal.

	Discovery Task
	Plasmodium polymorphisms	Dengue virus serotype classification	Provide target file	LAST default parameters
Substitution Matrix	ATMAP	-	-	-
Match Score (-r)	-	6	6	6
Mismatch Cost (-q)		12	12	18
Gap Existence Cost (-a)	12	12	12	21
Gap Extension Cost (-b)	3	4	4	9
Insertion Existence Cost (-A)	15	15	15	(a)
Query Letters per Random Alignment (-D)	1.00E+07	1.00E+06	1.00E+06	1.00E+06
Maximum Initial Matches per Query Position (-m)	100	100	100	10
Output Type (-j)	4	4	4	3
Input Format (-Q)	1	1	1	-

Table S4: Target sequences used for P. falciparum sequence analysis and their chromosome positions, strand, length and GC content.

target name	gene name	chromosome	start	end	strand	length	GC %
apocytochrome_b	apocytochrome_b	M76611	3438	4680	+	1243	27.35
PfATPase6_1	PfATPase6	Pf3D7_01_v3	267134	269239	-	2106	26.54
PfATPase6_2	PfATPase6	Pf3D7_01_v3	264641	267470	+	2830	23.36
pfmrp1_1	pfmrp1	Pf3D7_01_v3	464622	467289	+	2668	21.51
pfmrp1_2	pfmrp1	Pf3D7_01_v3	466960	470216	-	3257	22.6
DHFR-TS	DHFR-TS	Pf3D7_04_v3	747923	749956	+	2034	24.14
PfTCTP	PfTCTP	Pf3D7_05_v3	467406	468316	-	911	20.86
pfmdr1	pfmdr1	Pf3D7_05_v3	957756	962218	+	4463	24.22
PfCRT_1	PfCRT	Pf3D7_07_v3	403089	404828	+	1740	19.31
PfCRT_2	PfCRT	Pf3D7_07_v3	404757	406466	-	1710	19.36
DHPS	DHPS	Pf3D7_08_v3	548039	550780	+	2742	21.33
ABC_transporter	ABC_transporter	Pf3D7_08_v3	670708	675573	-	4866	19.77
K13-propeller	K13-propeller	Pf3D7_13_v3	1724572	1727035	-	2464	24.92

Details on NanoPipe

Data uploading and conversion

The user can select one of three analysis tasks ("Plasmodium polymorphisms", "Dengue virus serotype classification" or "Provide target file"). Depending on the task chosen, the alignment parameters for LAST are adapted (see section 2.2 and table S1) and the option to upload a "target file" becomes available.

NanoPipe accepts fastq and fast5¹ files as input, which can be provided as a single file, or as a compressed directory (as .zip or .tar.gz archive). Files in fast5 format are converted into fastq format using poretools (Loman, N. J., & Quinlan, A. R., 2014), with this command:

poretools fastq --type 2D --high-quality input.fast5

The option --type 2D specifies that only those reads are used, which are labelled by Metrichor as "twodirections", these are the reads for which the MinION^TM has read both the "template" and "complement" strand. With the option --high-quality only those reads are kept, for which the complement strand has more events than the template strand, i.e. there is the complete sequence for the complement strand available.

In case the task "Provide target file" is selected, the user has to upload a file with target sequences, which has to be in plain text fasta format. NanoPipe will then run the program lastdb from the LAST aligner software package (Kiełbasa, S. M. et al., 2011) to prepare these target sequences for the subsequent alignment to the query files, without specifying any additional parameters.

For the other tasks ("Plasmodium polymorphisms" and "Dengue virus serotype classification") the prepared target sequences are stored in the NanoPipe software's data folder.

Alignment of nanopore reads against the target sequences

The alignment of the nanopore reads to the target sequences is performed with the program lastal from the LAST aligner software package. The command used is:

lastal parameter lastdb_path input.fastq

where parameter stands for the parameters which are set specifically for the analysis task chosen, lastdb_path stands for the target sequences that were prepared using the lastdb program and input.fastq is the query file specified by the user. The resulting alignment in maf format is then used as an input to the last-split program, without giving any additional options.

Alignment analysis

The analysis of the alignments is performed with Python and R scripts and divided into several steps:

Alignment file conversion and filtering

The output of last-split is converted into a tabular format for easier subsequent analyses. This tabular format is filtered, so that for each nanopore read only the alignment with the highest "bit score" is kept.

Counting of aligned nucleotides in nanopore reads

Using this data, a python script then counts the nucleotides in the aligned nanopore reads at each target sequence position. This results in a tabular file that has as many rows as there are nucleotides in the target sequences, where each row gives the number of each nucleotide.

Consensus calling

Afterwards, the consensus sequence for each target sequence is called by finding the most common nucleotide for each target sequence position with an R script. If there is no single most common nucleotide, the IUPAC nucleotide ambiguity notation is used. A consensus nucleotide is only given, if there are at least 10 nanopore reads aligning to this position in the target sequence. If there are less than 10 nanopore reads or there is no single most common nucleotide, "N" is given as the consensus.

SNP calling

SNPs are called similarly, by filtering the dataset for positions in the alignment, where there are at least 10 nanopore reads and the consensus nucleotide is not identical to the original target sequence nucleotide at the same position.

Intersecting called SNPs with SNPs from literature ("known SNPs")

The NanoPipe identifies those P. falciparum SNPs in the query sequences that also appear in the MalariaGEN tool (Amato et al., 2015) and the publication by Nair et al. (2014). The NanoPipe software's data folder contains a tabular file with all SNPs from these sources that cover the analyzed target sequences. This file is then intersected with the polymorphisms identified in the query sequences, thus a new table is generated that gives all "known SNPs" found in the query sequences.

Calculation of number of nanopore reads and read lengths per target

To get the number of nanopore reads that aligned to each target sequence, a short shell command is used on the filtered alignment file (see section 2.3.1) that counts how often each target sequence's name appears in the file.

Results display

The lengths of the nanopore reads are plotted with R. Furthermore, "logo plots" are also generated with R that give an overview of the alignment and the sequence depth. In the logo plots, each position on the x axis represents one nucleotide in the target sequence. The aligning nanopore reads are displayed on the y axis, with each nucleotide being displayed in one color (red for "A", green for "C", yellow for "G", blue for "T" and grey for a gap). Thus, if one bar at a given position in the target is colored completely red, this means that at this position of the target sequence, all aligning nanopore reads at this position contain an "A". The target sequence is given color-coded in the blocks positioned on the bottom, and the consensus is given on top of the target sequence.

To show the results on the NanoPipe website, a Python script generates the HTML code for the results display.

References

Amato, R., Miotto, O., Woodrow, C., Almagro-Garcia, J., Sinha, I., Campino, S., … Kwiatkowski, D. P. (2015). Genomic epidemiology of the current wave of artemisinin resistant malaria. doi:10.1101/019737

Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3), 487–93. doi:10.1101/gr.113985.110

Loman, N. J., & Quinlan, A. R. (2014). Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics (Oxford, England), 30(23), 3399–3401. doi:10.1093/bioinformatics/btu555

Nair, S., Nkhoma, S. C., Serre, D., Zimmerman, P. A., Gorena, K., Daniel, B. J., … Cheeseman, I. H. (2014). Single-cell genomics for dissection of complex malaria infections. Genome Research, 24(6), 1028–38. doi:10.1101/gr.168286.113

fast5 is a HDF-based file format used by Oxford Nanopore Technologies for the reads sequenced by the MinION^TM

2015-11-04 11:33

LEGAL DISCLOSURE

DATA PROTECTION POLICY