arshajii / ema-paper-data

Data and resources for EMA paper

Home Page:http://ema.csail.mit.edu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on homologous regions

vladsavelyev opened this issue · comments

Hi,

I ran BWA, LongRanger, and EMA for NA12878 WGS dataset from https://support.10xgenomics.com/genome-exome/datasets/2.1.4/NA12878_WGS_v2, and looking at the challenging regions listed in your notebook:

C4A 6:31965242

c4a-6-31965242

AMY1 1:104197843

amy-1-104197843

CYP2D7 22:42537120

cyp2d7-22-42537120

It looks like EMA has a higher coverage in those regions, however all those extra alignments are secondary (shaded on the screenshot). I'm wondering if that's an expected picture for EMA? And those secondary alignments could not be resolved via the linked read information? As far as I understand, those secondary alignments will be ignored by variant callers. They look to add up noise in variation, as in this CYP2D6 screenshot (more vertical colored lines at the coverage tracks):

cyp2d6

Please, correct me if I'm wrong. I see you evaluated those regions with a more sophisticated strategy by checking against the NA12878 assemblies, so probably I'm not looking at these regions the right way.

I believe those are actually low-MAPQ alignments (EMA doesn't output separate secondary alignments right now, aside from the XA tag which AFAIK doesn't show up in IGV). I was actually planning on refining how MAPQs are assigned to reads in these homologous regions, which should fix this.

That's right, those are primary alignments, sorry for the confusion (though most of them have better matches somewhere else, but I guess we can trust them since you proved they are supported by the assemblies). That's a good news then; looking forward to refined MAPQ.