Nanopore · Plotting

Per base sequence content and quality

I wrote a script to produce QC plots analogous to the “Per base sequence quality” and ”Per base sequence content” from FastQC for Nanopore sequencing data. Since there is no fixed read length this representation is less straightforward, so it’s necessary to make a selection of the data. My script starts from a fastq file, which I parse using Biopython SeqIO. I start with making a histogram of the read lengths, to select the bin containing the most reads to represent the data of the entire run. This is probably not the most perfect approach and I welcome suggestions.

Below are the modules used in the script and some minimal argparse code for working conveniently with command line arguments. The rest of the functions are organised by the main()function. The full code of the script is at the bottom of the post.

I start with looping over the fastq file, using a list comprehension to get the length of all reads in a numpy array.

Based on these lengths a histogram is made using matplotlib. Matplotlib produces a histogram by passing the data to numpy.histogram, which I also use to return the number of reads per bin and the bin edges. Using a different algorithm for creating bins would result in more or less bins, and would yield different results in the downstream functions. The histogram is shown below. The index of the bin containing the most reads (about 4800) is found with np.argmax(), which is then used to get the edges of this bin.

LengthHistogram.png

In the following function, I loop a second time over the fastq file, this time to extract information of the reads belonging to the lucky bin. Using a list comprehension both the sequence and integer quality scores are returned in a tuple per read.

The read information is separately passed to the next functions, in the first the nucleotide diversity or fractions per position are plotted. The python base function zip() creates a generator in which the first nucleotides of each reads are combined in a new list, e.g. [[A,C,G,T],[C,G,A,A],[G,C,C,T]] becomes [[A,C,G],[C,G,C],[G,A,C], [T,A,T]], and as such per position the fraction per position can be calculated.  The zip() function will stop when the shortest element (or read) is exhausted, and therefore all plots are limited to the shortest read length in this bin, which is just less than 800 nucleotides (and therefore rather short for Nanopore sequencing data). The unpack/splat (*) operator is required to unpack the list fqlists in the separate lists.

The resulting plot is shown below. It’s clear that the first nucleotides are non-random, suggesting a bias in the fragmentation or base calling. Another striking observation is the overrepresentation of thymidine bases in the entire read, which is unlikely.

PerBaseSequenceContent

Generating a plot with the average quality per position is done using a similar function.

The result of this is shown below, showing a uniform quality score for most of the read but a remarkable low quality at the start of the reads. This seems like a good suggestion for cropping a part of the read, about 30-40 nucleotides.

PerBaseSequenceQuality

The full code of this script is shown below:

Advertisements

One thought on “Per base sequence content and quality

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s