Prepare a genome with or without repeat-masking
LiShuhang-gif opened this issue · comments
Hi, I was trying to run tandem_genotypes to detect tandem repeats on my ONT data. But I have some questions when preparing a genome. I see there are two options in this step —— prepare a genome with or without repeat-masking. If I care more about effect and accuracy than running time, should I prepare a genome without repeat-masking? Or which option do you recommend? Thanks a lot.
For whole human genome sequencing, we usually do it "with" repeat masking. That has worked fine in several published papers. So that's what I'd recommend, really.
For best possible accuracy/sensitivity, it's better to do it without repeat masking. But that uses much more time and memory.
For a smaller genome (e.g. bacterial) I'd do it without masking.
For best possible accuracy/sensitivity, it's better to do it without repeat masking. But that uses much more time and memory.
I have a query whose genome is 20G, and repeat annotation is still running. Can I do pairwise genome alignment using unmasked genome? It seems workable, although with more time and memory.
Pairwise genome alignment is a bit different from aligning long reads (in the preceding comments).
The preceding comments are also a bit out of date. Now I might suggest -uRY4
instead of masking, see:
https://www.biorxiv.org/content/10.1101/2022.05.30.494079v1
You can surely do unmasked pairwise genome alignment, if you use an option such as -uRY
to reduce the run time and memory use. If you don't use such an option, it might or might not be feasible: it depends on how big the other genome is, how closely-related, and how repetitive.
Thanks, let me give it a try