shahab-sarmashghi / RESPECT

Estimating repeat spectra and genome length from low-coverage genome skims

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to interpret uniqueness ratio for genome duplication

000generic opened this issue · comments

Hi!

I'm using RESPECT to evaluate 10 octopus genomes - genomes that are expected to be human-sized or larger and repeat rich. Years ago people thought the genomes would be duplicated - but this has not held up since the first cephalopod genome came out in 2015 for Octopus bimaculoides.

To get a sense of how RESPECT works and what its accuracy is like, I characterized a Octopus bimaculoides NCBI SRR PE fastq data set - split up and run separately as PE forward/1 and as PE reverse/2 - each at 4-5x coverage - to compare to the published genome. RESPECT's prediction of genome size in bimaculoides seems fairly accurate - and I think it's HCRM values are indicating the genome is repeat rich - which would be true. However, the uniqueness ratio is very low and if I understand correctly, this could indicate genome duplication. If the published genome indicates no genome duplication - what else could be causing the low uniqueness ratio that RESPECT is finding - or how much trust would you have in interpreting a low uniqueness ratio (much lower than the 0.8 cutoff indicated in the README) as an indication of a genome duplication?

Or am I misunderstanding how to interpret the output?

One thing that maybe I am doing incorrectly is that I am running RESPECT on a PE forward read fastq file and then on a PE reverse read fastq file - rather than including pairs together. But maybe this is throwing things off somehow. I feel like it shouldn't - I did it this way as an easy method to get close to the recommended 4x coverage for RESPECT.

Any comments or suggestions would be greatly appreciated!

Thank you very much - and thank you for such a fantastic tool. It seems potentially very effective and is simple to work with in a great way - and is a pleasure to use - I'm just unsure on the interpretation of things afterwards.

Eric

respect -i trimomatic-octopus-bimaculoides-1-paired.fastq trimomatic-octopus-bimaculoides-2-paired.fastq --debug

sample  input_type      sequence_type   coverage        genome_length   uniqueness_ratio        HCRM    sequencing_error_rate   average_read_length
trimomatic-octopus-bimaculoides-1-paired.fastq  sequence        genome-skim     4.85    2916125971      0.34    363.86  0.0043  149.7477
trimomatic-octopus-bimaculoides-2-paired.fastq  sequence        genome-skim     5.27    2686674457      0.37    385.22  0.0070  149.7511

Hi!

Thank you so much for all the detailed feedback! That all makes sense - and definitely helps clarify things for me. That would be fantastic if you were to take a look at Octopus. The 2015 paper is here:

https://www.nature.com/articles/nature14668

There is a newly released version of the genome in NCBI RefSeq that is now chromosome-scale. The SRR that I used was SRR20907430.

I can imagine it may be hard to have a clear cutoff for detecting genome duplication based on just high scoring kmers. I wonder, what if instead you considered the more general distribution. Or 6-translated and evaluated in protein space. Then typical repeats vs genome-scale single-copy coding sequence vs genome-scale duplicated coding sequence might become apparent? And protein space could allow things to go deeper into time.

Or 6-way translated and collected all protein sequence - and then evaluated kmers of just it for duplication....? Maybe this would clean out repeats and their variability between species/genomes - but still allow genome-scale-ish evaluation. Lots of incorrect protein fragments I could imagine but might not matter too much if they don't over whelm things.

It will be interesting to see how genome duplication evaluation works out for RESPECT in the end - and thank you again for such a useful tool.

Good luck :)