philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mapping to repeats leads to deletions with low allele frequency

flashton2003 opened this issue · comments

Hello,

I'm analysing some Cryptococcus neoformans (a haploid fungus) PacBio genome data. I noticed something strange when I was looking at some deletions which had low allele frequency. When only part of a repeated region was deleted, sometimes NGMLR was not consistent with how it split the read. Here is a clear example.

Screenshot 2020-02-04 at 15 42 02

There is a TTCTTCCCCC motif repeated four times in the reference genome. Most of the reads which map there only support there being one TTCTT part of the motif left (probably CCCCCTTCTTCCCCC), but the reads are mapped to different 'ends' of the 4-fold repeat in the reference genome. This means that the allele frequency is not as high as it should be, because each end of the deletion is only supported by around half the reads.

When I looked at the variants sniffles called, quite a lot of my deletions with low allele frequencies were in repeat regions.

I just wondered if there was a way to place these reads in repeat regions more consistently, as this would lead to more variants passing an allele frequency threshold of 80%.

Best,

Phil Ashton

Dear Phil,
thanks for reaching out. Yes this is a problem. Most of the time one requires some randomness in the alignment backtracking procedure to not accumulate artifacts. However, in these regions, this is less favorable.

Can you tell me if you tried to use the newer version of Sniffles and still get low frequency in such a region? I tried to improve this recently.
Thanks
Fritz

Hi Fritz,

I thought you might have come across this issue, it seems quite common in my data. Perhaps these repeat regions are susceptible to indels?

I'm using v1.0.11, which I think is the most up to date version?

Best,

Phil

Hi Phil,
Its a common problem I am investigating STR regions especially.
Go to the github from Sniffles and try v1.14 that improved a lot in GT and estimating the frequency.
Cheers
Fritz

Oh my bad 1.11 is the newest. Sorry beeing jetlaged in Brussel at the moment...

Ah, no worries.

Any thoughts on alternative filtering criteria, other than AF, which might help us include some of these ones?

I will need to think about it. I am up since yesterday..