Question on homologous regions
vladsavelyev opened this issue · comments
Hi,
I ran BWA, LongRanger, and EMA for NA12878 WGS dataset from https://support.10xgenomics.com/genome-exome/datasets/2.1.4/NA12878_WGS_v2, and looking at the challenging regions listed in your notebook:
C4A 6:31965242
AMY1 1:104197843
CYP2D7 22:42537120
It looks like EMA has a higher coverage in those regions, however all those extra alignments are secondary (shaded on the screenshot). I'm wondering if that's an expected picture for EMA? And those secondary alignments could not be resolved via the linked read information? As far as I understand, those secondary alignments will be ignored by variant callers. They look to add up noise in variation, as in this CYP2D6 screenshot (more vertical colored lines at the coverage tracks):
Please, correct me if I'm wrong. I see you evaluated those regions with a more sophisticated strategy by checking against the NA12878 assemblies, so probably I'm not looking at these regions the right way.
I believe those are actually low-MAPQ alignments (EMA doesn't output separate secondary alignments right now, aside from the XA
tag which AFAIK doesn't show up in IGV). I was actually planning on refining how MAPQs are assigned to reads in these homologous regions, which should fix this.
That's right, those are primary alignments, sorry for the confusion (though most of them have better matches somewhere else, but I guess we can trust them since you proved they are supported by the assemblies). That's a good news then; looking forward to refined MAPQ.