A relevant parameter when looking at sequencing and alignment quality is the edit distance to the reference genome, roughly equivalent to the accuracy of the reads. (Roughly, because we ignore true variants between the sample and reference). The edit distance or Levenshtein distance can be defined as the number of single letter (nucleotide) changes that have to be made to one string (read) for it to be equal to another string (reference genome). Since the error profile of Nanopore sequencing is dominated by insertions and deletions, the edit distance isn’t just the number of single nucleotide mismatches.
This technology results in a wide range of read lengths, therefore I scale the edit distance to the length of the aligned fragment. Longer reads shouldn’t get penalised more than shorter reads. It’s important to take the alignment length and not the read length since the ends of reads can be clipped substantially by the aligner, sometimes tens of kb. The experiments below were done using MinION data from the human genome in a bottle sample NA12878.
My scripts are in Python, so I’ll add some code snippets to this post. For parsing bam files and extracting the relevant bits of information, I use pysam which is pretty convenient and well-documented. Those snippets are not the full script, but just a minimal example to get the job done. A full example of the code is at the bottom of the post.
Getting this information from an alignment done using bwa mem is trivial since bwa sets the NM bam tag which is an integer with precisely what I need: the edit distance. A function to extract the NM tag using a list comprehension is added below.
For aligners such as GraphMap this is less trivial since the NM tag is not set. However, another tag comes to the rescue: the MD tag, which stores a string containing matching numbers of nucleotides and non-matching nucleotides. Interesting information about the MD tag can be found here. I found it a quite tough representation of the read to wrap my head around, which resulted in some wrong interpretations.
My first naive implementation counted the number of matching nucleotides (the integers in the MD string) and subtracted those matches from the total alignment length to get the number of mismatched nucleotides. The list comprehension is quite long, but essentially I split the MD string on all occurrences of A, C, T, G or ^ (indicating a deletion) and sum the obtained integers. This sum is subtracted from the aligned read length and divided by the same.
As a sanity check I looped over a bwa mem aligned bam file to extract both the scaled NM tag and scaled MD-derived edit distance and plot those against each other, for which you can see the code below followed by the images. Between gathering the information from the bam file in lists and plotting the data I convert my lists to a numpy array and create a pandas DataFrame (see bottom of post). Since the first plot is rather overcrowded I also added a kernel density estimation to get a better idea of the density of the dots.
The Pearson correlation coefficient is not too bad, but obviously, we want to get exactly the same as the NM tag. It’s clear that my implementation of getting the edit distance from the MD tag returns an underestimation of the edit distance. After thinking a while I decided to create a question on biostars where Santosh Anand joined my quest. He suggested counting the mismatches in the MD string, rather than the matches, which I implemented below. So this time I split the MD string on all numbers and sum the length of the mismatched nucleotides. The plots obtained by this approach are shown below the code.
That’s an improvement, but we’re not yet there. At this point Santosh asked for a few examples of reads which showed a clear deviation in NM and MD-derived edit distance. I wrote a function to do just that, which prints out:
- The NM tag
- The MD tag derived edit distance
- The MD tag
- The CIGAR string
This is an example of a read showing a clear mismatch between the NM tag and the MD-derived edit distance.
The hero on my quest then found that we were still missing the insertions, which are not present in the MD tag but can only be counted by parsing the CIGAR string. I copied the image below from his post.
So my successful implementation also parses the CIGAR string to get the insertions and add those to the mismatches. In my final code piece below I show the full code of my evaluation of the edit distances, including some multiprocessing code to process chromosomes in parallel and converting the lists to numpy arrays in a pandas DataFrame.
The kernel density estimate plot crashes on an exact correlation, so this time you only get the scatter plot. Problem solved and everyone lived happily ever after.
Big thanks to Santosh Anand!