How to interpret uniqueness ratio for genome duplication

Question

How to interpret uniqueness ratio for genome duplication

000generic opened this issue 2 years ago · comments

Hi!

I'm using RESPECT to evaluate 10 octopus genomes - genomes that are expected to be human-sized or larger and repeat rich. Years ago people thought the genomes would be duplicated - but this has not held up since the first cephalopod genome came out in 2015 for Octopus bimaculoides.

To get a sense of how RESPECT works and what its accuracy is like, I characterized a Octopus bimaculoides NCBI SRR PE fastq data set - split up and run separately as PE forward/1 and as PE reverse/2 - each at 4-5x coverage - to compare to the published genome. RESPECT's prediction of genome size in bimaculoides seems fairly accurate - and I think it's HCRM values are indicating the genome is repeat rich - which would be true. However, the uniqueness ratio is very low and if I understand correctly, this could indicate genome duplication. If the published genome indicates no genome duplication - what else could be causing the low uniqueness ratio that RESPECT is finding - or how much trust would you have in interpreting a low uniqueness ratio (much lower than the 0.8 cutoff indicated in the README) as an indication of a genome duplication?

Or am I misunderstanding how to interpret the output?

One thing that maybe I am doing incorrectly is that I am running RESPECT on a PE forward read fastq file and then on a PE reverse read fastq file - rather than including pairs together. But maybe this is throwing things off somehow. I feel like it shouldn't - I did it this way as an easy method to get close to the recommended 4x coverage for RESPECT.

Any comments or suggestions would be greatly appreciated!

Thank you very much - and thank you for such a fantastic tool. It seems potentially very effective and is simple to work with in a great way - and is a pleasure to use - I'm just unsure on the interpretation of things afterwards.

Eric

respect -i trimomatic-octopus-bimaculoides-1-paired.fastq trimomatic-octopus-bimaculoides-2-paired.fastq --debug

sample  input_type      sequence_type   coverage        genome_length   uniqueness_ratio        HCRM    sequencing_error_rate   average_read_length
trimomatic-octopus-bimaculoides-1-paired.fastq  sequence        genome-skim     4.85    2916125971      0.34    363.86  0.0043  149.7477
trimomatic-octopus-bimaculoides-2-paired.fastq  sequence        genome-skim     5.27    2686674457      0.37    385.22  0.0070  149.7511

Shahab Sarmashghi · Answer 1 · Tue Nov 01 2022 23:27:55 GMT+0800 (China Standard Time)

Hi Eric, Thank you! I am glad that you have found our tool useful and easy to use. CCing Vineet Bafna and Siavash Mirarab who are PI's that co-lead genome skimming related projects at UCSD. Regarding the way you are running RESPECT, I don't think there is a problem with it. Usually in the realm of genome skimming we work with low coverage data, and since our methods are alignment-free and do not exploit PE data, we merge read pairs in the preprocessing step (using BBtools) to get higher effective coverage (rather than running each read pair separately). That being said, if you have higher coverage data, there is no problem with running them separately as you did. Ideally, we would like to get identical results from forward and reverse reads, but there could be read-pair-specific sequencing artifacts that are not fully resolved in the preprocessing and result in differences. In your case, the difference is less than %10, which is within the typical estimation error of RESPECT, so this could just be due to noise and variation in the output of its (randomized) iterative algorithm. Now regarding the interpretation of repetitiveness, you are right, 0.35 uniqueness ratio is on the quite low end of the spectrum (Figure 5 of RESPECT paper). This shows that at the k-mer level (by default 31-mer substrings of the genome), only 35% of them are unique and 65% are at higher copies, which include both low-copy (2,3,.. copies) and high-copy (thousands of copies) repeats. To get some resolution on this, we define and estimate HCRM, which quantifies the contribution of high copy repeats. Based on that, we assume when the uniqueness ratio is low (so we have many repeats) and HCRM is low (so not many are from high copy repeats), the repeats come from low copy repeats which we hypothesized can be a signature of genome duplication. We indeed observed that r1/L < 0.8 and HCRM < 200 is enriched for species with known WGD in their evolutionary history (Figure 5 of RESPECT paper). For your sample, HCRM is ~375, and that *could* mean that the repetitiveness is mostly from high copy repeats (e.g., DNA transposons and retrotransposons). *However*, we do see some species like Atlantic salmon that have known WGD and still their HCRM is higher than 200, especially when the r1/L is well below 0.8, which is also the case with your sample. That suggests WGD is still a possibility. An important point to consider when interpreting these results is that, we use k-mers, and so the signature of old WGD in distant ancestral lineages is preserved much longer among short-length k-mers, compared to other methods that are based on finding gene duplication through alignment and paralog detection etc. If you know how the publishers of the genome concluded there is no genome duplication, or can point us to the corresponding study, we can look into it and discuss it further. In the end I should say that our work is by no means a comprehensive effort at finding WGDs or certainly ruling in favor or against them. However, we think there is a potential to use k-mer based methods to dissect the repetitive structure of the genome, and identify signatures of whole genome doubling or other repeat-forming events across species. Vineet and Siavash are actively leading related repeat-aware computational methods, and applying those to study the underlying science is certainly within their (future) research interests. Best, Shahab

…

On Mon, Oct 31, 2022 at 7:48 AM Eric Edsinger ***@***.***> wrote: Hi! I'm using RESPECT to evaluate 10 octopus genomes - genomes that are expected to be human-sized or larger and repeat rich. Years ago people thought the genomes would be duplicated - but this has not held up since the first cephalopod genome came out in 2015 for Octopus bimaculoides. To get a sense of how RESPECT works and what its accuracy is like, I characterized two Octopus bimaculoides NCBI SRR data sets at 4-5x coverage to compare to the published genome. RESPECT's prediction of genome size seems fairly accurate - and I think it's HCRM values are indicating the genome is repeat rich - which would be true. However, the uniqueness ratio is very low and if I understand correctly this could indicate genome duplication. If the published genome indicates no genome duplication - what else could be causing the low uniqueness ratio that RESPECT is finding - or how much trust would you have in interpreting a low uniqueness ratio (much lower than the 0.8 cutoff indicated in the README) as an indication of a genome duplication? Or am I misunderstanding how to interpret the output? One thing that maybe I am doing incorrectly is that I am running RESPECT on a PE forward read fastq file and then on a PE reverse read fastq file - rather than including pairs together. But maybe this is throwing things off somehow. I feel like it shouldn't - and just did it this way as an easy method to get close to the recommended 4x coverage for RESPECT. Any comments or suggestions would be greatly appreciated! Thank you very much - and thank you for such a fantastic tool. It seems potentially very effective and is simple to work with in a great way - and is a pleasure to use - I'm just unsure on the interpretation of things afterwards. Eric respect -i trimomatic-octopus-bimaculoides-1-paired.fastq trimomatic-octopus-bimaculoides-2-paired.fastq --debug sample input_type sequence_type coverage genome_length uniqueness_ratio HCRM sequencing_error_rate average_read_length trimomatic-octopus-bimaculoides-1-paired.fastq sequence genome-skim 4.85 2916125971 0.34 363.86 0.0043 149.7477 trimomatic-octopus-bimaculoides-2-paired.fastq sequence genome-skim 5.27 2686674457 0.37 385.22 0.0070 149.7511 — Reply to this email directly, view it on GitHub <#10>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMUG3TP3I57WUOKAONYGZTWF6W2RANCNFSM6AAAAAARS74E3E> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Shahab Sarmashghi · Answer 2 · Wed Nov 02 2022 00:18:18 GMT+0800 (China Standard Time)

Thank you, Shahab. Eric, Shahab's email is long and covers a lot of ground, but very briefly, we are not quite ready to call WGDs yet. However, we are working to identify the repeat content of a genome using genome skims. If you could point us to the original paper describing the genome of this organism, we will add it to our list that needs to be re-analyzed as we sharpen our tools. Sincerely, On Tue, Nov 1, 2022 at 8:27 AM Shahab Sarmashghi < ***@***.***> wrote:

…

Hi Eric, Thank you! I am glad that you have found our tool useful and easy to use. CCing Vineet Bafna and Siavash Mirarab who are PI's that co-lead genome skimming related projects at UCSD. Regarding the way you are running RESPECT, I don't think there is a problem with it. Usually in the realm of genome skimming we work with low coverage data, and since our methods are alignment-free and do not exploit PE data, we merge read pairs in the preprocessing step (using BBtools) to get higher effective coverage (rather than running each read pair separately). That being said, if you have higher coverage data, there is no problem with running them separately as you did. Ideally, we would like to get identical results from forward and reverse reads, but there could be read-pair-specific sequencing artifacts that are not fully resolved in the preprocessing and result in differences. In your case, the difference is less than %10, which is within the typical estimation error of RESPECT, so this could just be due to noise and variation in the output of its (randomized) iterative algorithm. Now regarding the interpretation of repetitiveness, you are right, 0.35 uniqueness ratio is on the quite low end of the spectrum (Figure 5 of RESPECT paper). This shows that at the k-mer level (by default 31-mer substrings of the genome), only 35% of them are unique and 65% are at higher copies, which include both low-copy (2,3,.. copies) and high-copy (thousands of copies) repeats. To get some resolution on this, we define and estimate HCRM, which quantifies the contribution of high copy repeats. Based on that, we assume when the uniqueness ratio is low (so we have many repeats) and HCRM is low (so not many are from high copy repeats), the repeats come from low copy repeats which we hypothesized can be a signature of genome duplication. We indeed observed that r1/L < 0.8 and HCRM < 200 is enriched for species with known WGD in their evolutionary history (Figure 5 of RESPECT paper). For your sample, HCRM is ~375, and that *could* mean that the repetitiveness is mostly from high copy repeats (e.g., DNA transposons and retrotransposons). *However*, we do see some species like Atlantic salmon that have known WGD and still their HCRM is higher than 200, especially when the r1/L is well below 0.8, which is also the case with your sample. That suggests WGD is still a possibility. An important point to consider when interpreting these results is that, we use k-mers, and so the signature of old WGD in distant ancestral lineages is preserved much longer among short-length k-mers, compared to other methods that are based on finding gene duplication through alignment and paralog detection etc. If you know how the publishers of the genome concluded there is no genome duplication, or can point us to the corresponding study, we can look into it and discuss it further. In the end I should say that our work is by no means a comprehensive effort at finding WGDs or certainly ruling in favor or against them. However, we think there is a potential to use k-mer based methods to dissect the repetitive structure of the genome, and identify signatures of whole genome doubling or other repeat-forming events across species. Vineet and Siavash are actively leading related repeat-aware computational methods, and applying those to study the underlying science is certainly within their (future) research interests. Best, Shahab On Mon, Oct 31, 2022 at 7:48 AM Eric Edsinger ***@***.***> wrote: > Hi! > > I'm using RESPECT to evaluate 10 octopus genomes - genomes that are > expected to be human-sized or larger and repeat rich. Years ago people > thought the genomes would be duplicated - but this has not held up since > the first cephalopod genome came out in 2015 for Octopus bimaculoides. > > To get a sense of how RESPECT works and what its accuracy is like, I > characterized two Octopus bimaculoides NCBI SRR data sets at 4-5x coverage > to compare to the published genome. RESPECT's prediction of genome size > seems fairly accurate - and I think it's HCRM values are indicating the > genome is repeat rich - which would be true. However, the uniqueness ratio > is very low and if I understand correctly this could indicate genome > duplication. If the published genome indicates no genome duplication - what > else could be causing the low uniqueness ratio that RESPECT is finding - or > how much trust would you have in interpreting a low uniqueness ratio (much > lower than the 0.8 cutoff indicated in the README) as an indication of a > genome duplication? > > Or am I misunderstanding how to interpret the output? > > One thing that maybe I am doing incorrectly is that I am running RESPECT > on a PE forward read fastq file and then on a PE reverse read fastq file - > rather than including pairs together. But maybe this is throwing things off > somehow. I feel like it shouldn't - and just did it this way as an easy > method to get close to the recommended 4x coverage for RESPECT. > > Any comments or suggestions would be greatly appreciated! > > Thank you very much - and thank you for such a fantastic tool. It seems > potentially very effective and is simple to work with in a great way - and > is a pleasure to use - I'm just unsure on the interpretation of things > afterwards. > > Eric > > respect -i trimomatic-octopus-bimaculoides-1-paired.fastq > trimomatic-octopus-bimaculoides-2-paired.fastq --debug > > sample input_type sequence_type coverage genome_length uniqueness_ratio HCRM sequencing_error_rate average_read_length > trimomatic-octopus-bimaculoides-1-paired.fastq sequence genome-skim 4.85 2916125971 0.34 363.86 0.0043 149.7477 > trimomatic-octopus-bimaculoides-2-paired.fastq sequence genome-skim 5.27 2686674457 0.37 385.22 0.0070 149.7511 > > — > Reply to this email directly, view it on GitHub > <https://urldefense.com/v3/__https://github.com/shahab-sarmashghi/RESPECT/issues/10__;!!Mih3wA!DjVgvznAyRBB7uWAzwffmbnusztFx47kmIbGPA-evki9mnJpNE6saOQAPn6YcNUeNmskKIx4NQ5VNck2V2xsi5V0RA$>, > or unsubscribe > <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEMUG3TP3I57WUOKAONYGZTWF6W2RANCNFSM6AAAAAARS74E3E__;!!Mih3wA!DjVgvznAyRBB7uWAzwffmbnusztFx47kmIbGPA-evki9mnJpNE6saOQAPn6YcNUeNmskKIx4NQ5VNck2V2z0DmJ_kA$> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

Eric Edsinger · Answer 3 · Wed Nov 02 2022 14:38:41 GMT+0800 (China Standard Time)

Hi!

Thank you so much for all the detailed feedback! That all makes sense - and definitely helps clarify things for me. That would be fantastic if you were to take a look at Octopus. The 2015 paper is here:

https://www.nature.com/articles/nature14668

There is a newly released version of the genome in NCBI RefSeq that is now chromosome-scale. The SRR that I used was SRR20907430.

I can imagine it may be hard to have a clear cutoff for detecting genome duplication based on just high scoring kmers. I wonder, what if instead you considered the more general distribution. Or 6-translated and evaluated in protein space. Then typical repeats vs genome-scale single-copy coding sequence vs genome-scale duplicated coding sequence might become apparent? And protein space could allow things to go deeper into time.

Or 6-way translated and collected all protein sequence - and then evaluated kmers of just it for duplication....? Maybe this would clean out repeats and their variability between species/genomes - but still allow genome-scale-ish evaluation. Lots of incorrect protein fragments I could imagine but might not matter too much if they don't over whelm things.

It will be interesting to see how genome duplication evaluation works out for RESPECT in the end - and thank you again for such a useful tool.

Good luck :)

Shahab Sarmashghi · Answer 4 · Wed Nov 02 2022 22:05:21 GMT+0800 (China Standard Time)

Hi Eric, Thank you for sharing the link to the paper, and your ideas about WGD identification. You are right, it's hard to decide on it using a simple cutoff on high copy repeats, and more detailed analysis is needed. One suggestion: since there is a new assembly available, you can run RESPECT on both assemblies and see if there is any difference in terms of the uniqueness ratios. That can show how well repeats are captured in the original assembly. You can also compare those with estimates from reads. Best, Shahab

…

On Wed, Nov 2, 2022 at 2:38 AM Eric Edsinger ***@***.***> wrote: Hi! Thank you so much for all the detailed feedback! That all makes sense - and definitely helps clarify things for me. That would be fantastic if you were to take a look at Octopus. The 2015 paper is here: https://www.nature.com/articles/nature14668 There is a newly released version of the genome in NCBI RefSeq that is now chromosome-scale. The SRR that I used was SRR20907430. I can imagine it may be hard to have a clear cutoff for detecting genome duplication based on just high scoring kmers. I wonder, what if instead you considered the more general distribution. Or 6-translated and evaluated in protein space. Then typical repeats vs genome-scale single-copy coding sequence vs genome-scale duplicated coding sequence might become apparent? And protein space could allow things to go deeper into time. Or 6-way translated and collected all protein sequence - and then evaluated kmers of just it for duplication....? It would be interesting to see how it works out for RESPECT in the end - and thank you again for such a useful tool. Good luck :) — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMUG3UOO5F4CLMCDCP2ONLWGID7XANCNFSM6AAAAAARS74E3E> . You are receiving this because you commented.Message ID: ***@***.***>