For my NanoPlot tool, I have been calculating the average basecall quality of a read by simply calculating the arithmetic mean of the Phred scores. I recently also added an option to generate plots based on the sequencing_summary.txt file generated by the albacore basecaller, which completely avoids parsing the fastq file and calculating the mean. But as you can see below this doesn’t yield an equivalent result when plotting the log transformed read length versus average read quality. The first plot shows the quality as calculated from the fastq file, the second plot was generated using the albacore summary. This difference in average quality was also observed by Ola Wallerman.
Forrest Brennen from Oxford Nanopore Technologies explained what’s going wrong on the community forum:
What you’d like the mean_qscore for a read to represent is the mean error rate for that read, but this means you can’t just do a simple arithmetic mean of all the qscores, because it won’t be a representation of the mean error rate then.
For example, say you have three bases, with error rates of 1%, 10%, and 50% (and qscores of 20, 10, and ~3 respectively).
The arithmetic mean of the qscores is 33 / 3 = 11.
The arithmetic mean of the error rates is 0.61 / 3 ~= 0.2.
The qscore for the arithmetic mean of the error rates is -10 * Log10(0.2) = 6.99If you take the qscore for each base, convert it back to an error probability, take the mean of those, and then convert that mean error back into a qscore, you should get something very close to what albacore provides.
Looks like I need to adapt my script! Initially, my function for averaging quality scores (in the nanomath module) was simply the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def aveQual(quals): | |
''' | |
Receive the integer quality scores of a read and return the average quality for that read | |
''' | |
return sum(quals) / len(quals) |
Adapting that to the one below resulted in the plot shown below the code, which indeed shrinks the quality scores but isn’t entirely the same as the albacore average basecall quality, yet.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def aveQual(quals): | |
''' | |
Receive the integer quality scores of a read and return the average quality for that read | |
First convert Phred scores to probabilities, calculate average error probability and convert average back to Phred scale | |
''' | |
return –10*math.log(sum([10**(q/–10) for q in quals]) / len(quals), 10) |
With this change in average read quality scores, I should revisit my conclusion on the accuracy of the basecall quality scores. Now, for aligned reads, the peak average quality is about 13 which theoretically corresponds to an accuracy of 95%. Looking at the quality scores above from the albacore summary result the peak is just below 12, which would correspond to ~93% accuracy. Those numbers are getting closer to ~88%, which is obtained by calculating the percent identity from the alignment.