niu-lab / gclust

genome sized sequences clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The coverage problem and (maybe) wrong cluster problem

liaoherui opened this issue · comments

commented

Thanks for your wonderful tool !

My problem is

  1. if there are the parameters related with the alignment coverage.
    For example,
    微信图片_20200218223602
    Just like the picture shows, the query genome and target genome share 99.46% identity but only 84% coverage. When I set the "-memiden" to 99, they will be assigned to the same cluster....

    So, if there are some parameters about the "coverage" threashold filtering?

  2. In my experiment ,there are 2 highly similar genome, their identity and coverage is displayed as below picture:
    微信图片_20200218224234
    However, when I set the "-memiden" to 99, they are assigned to different clusters, that really makes me confused...I am not sure what's going on...

(All the alignment in the picture is done by the online megablast alignment tool.)

Thank you very much for your interest in our software. For your problem, the possible causes and solutions are as follows:

  1. The calculation method of consistency in software may be slightly different from blastn software. Our definition is as follows: The extended MEMs identity (eMEMi) is calculated using the following formula:
    eMEMi = Nmatch / Lquery, where "Nmatch" is the number of matched nucleotides within extended MEMs and "Lquery" is the length of the shorter sequence.

Therefore, the premise of execution is that the length of the query sequence is longer than the representative sequence (that is, what you mean Targeting sequence) is short. You can check the consistency of the two sequences under our software and confirm the clustering results.

  1. The two alignment sequences you mentioned: ZKV_420 vs. Query_57683, and ZKV184 vs. Query_8205. Is it a public serial number? But we can't access it online, if you can provide the serial number or serial number if convenient.

I hope to help you, if you still have questions, please feel free to send an email to lirl@sccas.cn or niubf@cnic.cn, we are happy to discuss and communicate with you, thank you very much!

commented

Hi, sclirl, thanks for your fast reply!

For Problem 1, I almost understand what you mean. So in theory, if I set "-memiden" to 99 ,then for the genomes in the same cluster ,all of these genomes should have eMEMi >=99% to the representative genome (the longest one),right?

About the problem2, I have uploaded the fasta file I used to do the experiment. There are 648 Viral complete genomes in the fasta, ZKV_2 and ZKV_184 is the case that displayed in the Problem2 picture. They share high similarity but assigned to different cluster by Gclust.
The command I use is:
gclust -minlen 41 -both -nuc -threads 16 -chunk 400 -loadall -memiden 99 -rebuild -ext 1 -sparse 4 ZKV_rebuild_gclust_remove100.fasta > ZKV_rebuild.gclust.cutoff_99.cls

You can download the data and see what's going on in this case.
ZKV_rebuild_gclust_remove100.zip

commented

Hi, sclirl

Sorry to say that I got the possible reason for Problem2... I forget to sort all the genomes before I run Gclust. I can get the right cluster after the sorting step for ZKV_2 and ZKV_184...

However, for ZKV_26 and ZKV_184 , the problem still exists even I sort the genome, they are very similar (>99% query cov and >99% identity with online megablast), but they are assigned to different clusters.... That makes me confused....

Hi, liaoherui,
Yes, your understanding of the "Problem 1" is correct, i.e. all of the genomes in a cluster should have eMEMi >=99% to the representative genome (the longest one) under the condition of '"-memiden 99'.

Problem2: You can set the parameters -minlen and -sparse to a smaller value. These two parameters have a greater impact on the clustering result, such as the recommended values: -minlen 21, -sparse 1 (or 2).

If there are other questions, we can communicate at any time, thank you!