shahab-sarmashghi / RESPECT

Estimating repeat spectra and genome length from low-coverage genome skims

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Behavior of RESPECT on diploid, heterozygous organisms?

jflot opened this issue · comments

Hello Shahab,
I have been trying RESPECT on some heterozygous genomes from diploid organisms. In such cases it seems that RESPECT returns the diploid genome size: for instance, on the rotifer Adineta vaga (haploid assembly size of the published reference genome 100 Mb, diploid genome size 200 Mb; Simion et al. 2021), using the first 20% of read1 from https://www.ncbi.nlm.nih.gov/sra/ERX295060[accn] I get the following output
sample input_type sequence_type coverage genome_length uniqueness_ratio HCRM sequencing_error_rate average_read_length
ERR321927_1_twentieth.fastq sequence genome-skim 2.44 207372387 0.46 56.72 0.0028 100.9993
Do you confirm that it is the intended behavior of RESPECT? If so, it might be worth mentioning it in the README, since most users would probably expect RESPECT to return the expected size of a haploid collapsed assembly rather than that of a diploid phased one.

You are right, we should mentioned that. The genome length is computed from the k-mers distribution. If within-species heterozygosity is small then it would be close to haploid length (slightly more), but if the heterozygosity is large then it will get closer to diploid genome length.

Added a note to the README