Nanopore · Plotting

Getting the edit distance from a bam alignment: a journey

A relevant parameter when looking at sequencing and alignment quality is the edit distance to the reference genome, roughly equivalent to the accuracy of the reads. (Roughly, because we ignore true variants between the sample and reference). The edit distance or Levenshtein distance can be defined as the number of single letter (nucleotide) changes that have to be made to one string (read) for it to be equal to another string (reference genome). Since the error profile of Nanopore sequencing is dominated by insertions and deletions, the edit distance isn’t just the number of single nucleotide mismatches.

This technology results in a wide range of read lengths, therefore I scale the edit distance to the length of the aligned fragment. Longer reads shouldn’t get penalised more than shorter reads. It’s important to take the alignment length and not the read length since the ends of reads can be clipped substantially by the aligner, sometimes tens of kb.  The experiments below were done using MinION data from the human genome in a bottle sample NA12878.

My scripts are in Python, so I’ll add some code snippets to this post. For parsing bam files and extracting the relevant bits of information, I use pysam which is pretty convenient and well-documented. Those snippets are not the full script, but just a minimal example to get the job done. A full example of the code is at the bottom of the post.

Getting this information from an alignment done using bwa mem  is trivial since bwa sets the NM bam tag which is an integer with precisely what I need: the edit distance. A function to extract the NM tag using a list comprehension is added below.


import pysam
def extractNMFromBam(bam):
'''
loop over a bam file and get the edit distance to the reference genome
stored in the NM tag
scale by aligned read length
'''
samfile = pysam.AlignmentFile(bam, "rb")
return [read.get_tag("NM")/read.query_alignment_length for read in samfile.fetch()]

view raw

getNMtag.py

hosted with ❤ by GitHub

For aligners such as GraphMap this is less trivial since the NM tag is not set. However, another tag comes to the rescue: the MD tag, which stores a string containing matching numbers of nucleotides and non-matching nucleotides. Interesting information about the MD tag can be found here. I found it a quite tough representation of the read to wrap my head around, which resulted in some wrong interpretations.

My first naive implementation counted the number of matching nucleotides (the integers in the MD string) and subtracted those matches from the total alignment length to get the number of mismatched nucleotides. The list comprehension is quite long, but essentially I split the MD string on all occurrences of A, C, T, G or ^ (indicating a deletion) and sum the obtained integers. This sum is subtracted from the aligned read length and divided by the same.


import pysam
import re
def extractMDFromBam(bam):
'''
loop over a bam file and get the edit distance to the reference genome
mismatches are stored in the MD tag
scale by aligned read length
'''
samfile = pysam.AlignmentFile(bam, "rb")
return [(read.query_alignment_length – sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length for read in samfile.fetch()]

As a sanity check I looped over a bwa mem aligned bam file to extract both the scaled NM tag and scaled MD-derived edit distance and plot those against each other, for which you can see the code below followed by the images. Between gathering the information from the bam file in lists and plotting the data I convert my lists to a numpy array and create a pandas DataFrame (see bottom of post). Since the first plot is rather overcrowded I also added a kernel density estimation to get a better idea of the density of the dots.


import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
def makePlot(datadf):
plot = sns.jointplot(
x='editDistancesNM',
y='editDistancesMD',
data=datadf,
kind="scatter",
color="#4CB391",
stat_func=stats.pearsonr,
space=0,
joint_kws={"s": 1},
size=10)
plot.savefig('EditDistancesCompared_scatter.png', format='png', dpi=1000)
plot = sns.jointplot(
x='editDistancesNM',
y='editDistancesMD',
data=datadf,
kind="kde",
color="#4CB391",
stat_func=stats.pearsonr,
space=0,
size=10)
plot.savefig('EditDistancesCompared_kde.png', format='png', dpi=1000)

The Pearson correlation coefficient is not too bad, but obviously, we want to get exactly the same as the NM tag. It’s clear that my implementation of getting the edit distance from the MD tag returns an underestimation of the edit distance. After thinking a while I decided to create a question on biostars where Santosh Anand joined my quest. He suggested counting the mismatches in the MD string, rather than the matches, which I implemented below. So this time I split the MD string on all numbers and sum the length of the mismatched nucleotides. The plots obtained by this approach are shown below the code.


import pysam
import re
def extractMDFromBam(bam):
'''
loop over a bam file and get the edit distance to the reference genome
mismatches are stored in the MD tag
scale by aligned read length
'''
samfile = pysam.AlignmentFile(bam, "rb")
return [sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))]) / read.query_alignment_length for read in samfile.fetch()]

That’s an improvement, but we’re not yet there. At this point Santosh asked for a few examples of reads which showed a clear deviation in NM and MD-derived edit distance. I wrote a function to do just that, which prints out:

  • The NM tag
  • The MD tag derived edit distance
  • The MD tag
  • The CIGAR string


import sys
import re
import pysam
def extractDisagreement(bam):
samfile = pysam.AlignmentFile(bam, "rb")
for read in samfile.fetch():
NMdef = read.get_tag("NM")/read.query_alignment_length
MDdef = sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))])/read.query_alignment_length
if NMdef – MDdef > 0.2:
print('\t'.join(
[
str(read.get_tag("NM")),
str(sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))])),
read.get_tag("MD"),
read.cigarstring,
])
)

This is an example of a read showing a clear mismatch between the NM tag and the MD-derived edit distance.

73
34
21G1C3A1T1A2G4A8T2G4A2G0G0G19^G2G10T6C1A2A0G0G0A5T6^GA5A0G0C0C1G3A5A6C1C2
1684H20M3I7M3I10M1I5M2I1M1I1M3I6M1I4M1I13M2I5M4I3M7I6M1D6M2I1M2I26M1I3M1I2M1I3M2D16M1I6M3I10M10942H

The hero on my quest then found that we were still missing the insertions, which are not present in the MD tag but can only be counted by parsing the CIGAR string. I copied the image below from his post.

CountingInsertions

So my successful implementation also parses the CIGAR string to get the insertions and add those to the mismatches. In my final code piece below I show the full code of my evaluation of the edit distances, including some multiprocessing code to process chromosomes in parallel and converting the lists to numpy arrays in a pandas DataFrame.

The kernel density estimate plot crashes on an exact correlation, so this time you only get the scatter plot. Problem solved and everyone lived happily ever after.

Big thanks to Santosh Anand!

EditDistancesCompared_scatter

 


import sys
import os
import re
import seaborn as sns
import pandas as pd
import numpy as np
from multiprocessing import Pool
from scipy import stats
import matplotlib.pyplot as plt
import pysam
def processBam(bam):
'''
Processing function: calls pool of worker functions
to extract from a bam file two definitions of the edit distances to the reference genome scaled by read length
Returned in a pandas DataFrame
'''
samfile = pysam.AlignmentFile(bam, "rb")
if not samfile.has_index():
pysam.index(bam)
samfile = pysam.AlignmentFile(bam, "rb") # Need to reload the samfile after creating index
chromosomes = samfile.references
datadf = pd.DataFrame()
pool = Pool(processes=12)
params = zip([bam]*len(chromosomes), chromosomes)
try:
output = [results for results in pool.imap(extractFromBam, params)]
except KeyboardInterrupt:
print("Terminating worker threads")
pool.terminate()
pool.join()
sys.exit()
datadf["editDistancesNM"] = np.array([x for y in [elem[0] for elem in output] for x in y])
datadf["editDistancesMD"] = np.array([x for y in [elem[1] for elem in output] for x in y])
return datadf
def extractFromBam(params):
'''
Worker function per chromosome
loop over a bam file and create tuple with lists containing metrics:
two definitions of the edit distances to the reference genome scaled by aligned read length
'''
bam, chromosome = params
samfile = pysam.AlignmentFile(bam, "rb")
editDistancesNM = []
editDistancesMD = []
for read in samfile.fetch(reference=chromosome, multiple_iterators=True):
editDistancesNM.append(read.get_tag("NM")/read.query_alignment_length)
editDistancesMD.append(
(sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))]) + # Parse MD string to get mismatches/deletions
sum([item[1] for item in read.cigartuples if item[0] == 1])) # Parse cigar to get insertions
/read.query_alignment_length)
return (editDistancesNM, editDistancesMD)
def makePlot(datadf):
try:
plot = sns.jointplot(
x='editDistancesNM',
y='editDistancesMD',
data=datadf,
kind="kde",
color="#4CB391",
stat_func=stats.pearsonr,
space=0,
size=10)
plot.savefig('EditDistancesCompared_kde.png', format='png', dpi=1000)
except: # throws an error with perfect correlation! 😀
pass
plot = sns.jointplot(
x='editDistancesNM',
y='editDistancesMD',
data=datadf,
kind="scatter",
color="#4CB391",
stat_func=stats.pearsonr,
space=0,
joint_kws={"s": 1},
size=10)
plot.savefig('EditDistancesCompared_scatter.png', format='png', dpi=1000)
df = processBam(sys.argv[1])
makePlot(df)

3 thoughts on “Getting the edit distance from a bam alignment: a journey

  1. Thanks for the very informative post.

    I am a bit confused by this line of code:

    params = zip([bam]*len(chromosomes), chromosomes)In In b
    

    Specificially, what is the ‘In In b’ part?

    Thanks again.

    Like

Leave a comment