shahab-sarmashghi / RESPECT

Estimating repeat spectra and genome length from low-coverage genome skims

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Massive genome size estimates at high HRCM

mason-linscott opened this issue · comments

Hi Shahab,

I have used RESPECT in the past for relatively high coverage (6-10x) genome skims and it worked great for land snails with HRCM ranging from 400-1600. However, my lower coverage (2x) dataset of 44 land snail species seems to be having problems. It seems at this coverage range, when HRCM exceeds 800 genome size estimates inflate rapidly. I have attached the output file.

If you have any advice, I would appreciate it.
RESPECT_31k_estimated_params.txt

Hi Mason,

Sorry for the delayed response.

I wouldn't say it's because of the HCRM values, something's gone wrong and caused the coverage, length and other estimates (including HCRM) to be off for 6 samples compared to the rest of them. So now, the question is whether the samples are problematic, or RESPECT is failing.

To test the former, have you preprocessed the samples and removed possible contamination? In the higher-coverage dataset that worked fine, are they the same samples that were just fine at 6-10X and failed at 2X, or those samples are from different individuals?

To test the latter possibility that the RESPECT algorithm is failing (assuming the samples are fine and there is no sequencing artifact, contamination, etc.,), I see that all problematic samples have an estimated error rate of 0, which is weird. Probably RESPECT is estimating negative error rate, and so that's why it ends up with clipping the error rate to 0. We can also take a look at the optimization logs under tmp directory to see the sequence of solutions tried by the RESPECT algorithm. If it's indeed algorithm failure, one possible fix can be to set the error rate to the average error rate from the other samples that we think the estimates are reasonably good. Currently that is not exposed to the user and you need to hardcode that, but if you want to try it I can create a new branch and make that option available. If that solves the issue, I can merge that into the master branch and make that a part of software.

BTW, can I ask what is the fifth column? There is no header for it and it doesn't seem to be what is included in the standard RESPECT output?