[BUG] anvi-dereplicate-genomes python? error
natalia-rodilla opened this issue · comments
Short description of the problem
anvi-dereplicate-genomes
fails when some genomes are not grouped in a cluster. See discord: https://discord.com/channels/1002537821212512296/1205459293932093450
anvi'o version
Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.13
Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2
System info
Which operating system you are using? Ubuntu 18.04.6 LTS
How did you install anvi'o? Using conda
Detailed description of the issue
I tried to run anvi-dereplicate-genomes
on a collection of (66) external genomes. I expected some of them to be clustered together and others to be the only ones in their cluster. I got an error instead:
I tried using pyANI, with the same error. In neither case I obtained the similarity results, although I had previously run anvi-compute-genome-similarity
on the same set of external genomes, successfully.
When I ran anvi-dereplicate-genomes
setting --representative-method
to length or centrality (default) on the same set of genomes, I didn't have any issues.
When I tried to subset some genomes for a reproducible example, I managed to narrow down the issue. If I'm testing genomes that cluster with others (one or several clusters) it runs without problems. If I test genomes that end up alone in a cluster, the error appears, even if there is only 1 that is not clustered together with others.
Command, messages and traceback text:
anvi-dereplicate-genomes -e test9-external-genomes.txt -o derep99Q_test10/ --program fastANI --similarity-threshold 0.99 --representative-method Qscore
Run mode .....................................: fastANI
CITATION
===============================================
Anvi'o will use 'fastANI' by Jain et al. (DOI: 10.1038/s41467-018-07641-9) to
compute ANI. If you publish your findings, please do not forget to properly
credit their work.
[fastANI] Kmer size ..........................: 16
[fastANI] Fragment length ....................: 3,000
[fastANI] Min fraction of alignment ..........: 0.25
[fastANI] Num threads to use .................: 1
[fastANI] Log file path ......................: /tmp/tmp1nhwi5cf
fastANI similarity metric ....................: calculated
Number of genomes considered .................: 7
[09 Feb 24 16:11:21 Dereplication] All 21 pairwise comparisons have been made ETA: NoneTraceback (most recent call last):
File "/home/bioinfoteam/anaconda3/envs/anvio-8/bin/anvi-dereplicate-genomes", line 118, in <module>
derep.process()
File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 390, in process
self.dereplicate()
File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 511, in dereplicate
self.cluster_to_representative = self.get_representative_for_each_cluster()
File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 595, in get_representative_for_each_cluster
representative_name = self.pick_representative_with_largest_Qscore(cluster)
File "/home/bioinfoteam/anaconda3/envs/anvio-8/lib/python3.10/site-packages/anvio/genomesimilarity.py", line 549, in pick_representative_with_largest_Qscore
return cluster[0]
TypeError: 'set' object is not subscriptable
Files / commands to reproduce the issue
Command:
anvi-dereplicate-genomes -e test9-external-genomes.txt -o derep99Q_test/ --program fastANI --similarity-threshold 0.99 --representative-method Qscore
I uploaded the files here: https://drive.google.com/drive/folders/1mHq6k2pNwlzufIlT_tSDnuKJpAtlSh6i?usp=drive_link
The "problematic" genome in this case is the last one in the genomes file.
Hi @natalia-rodilla,
Thank you very much for the detailed report and the test case. I was able to reproduce your error on my system.
As you have discerned already, it seems to be a bug that happens only when there is a single genome in a given cluster. This is where it is failing in the genomesimilarity.py
code:
if len(cluster) == 1:
return cluster[0]
The problem is that the pick_representative_with_largest_Qscore()
function (in which this code is), seems to expect that the cluster
variable is a list, when in fact it is a set.
The very simple fix was to cast cluster
to a list before trying to extract the sole element, which I implemented in commit fbc6cf3 .
This is the output I get on your test set after running with the fixed code in the development branch of anvi'o:
Run mode .....................................: fastANI
CITATION
===============================================
Anvi'o will use 'fastANI' by Jain et al. (DOI: 10.1038/s41467-018-07641-9) to
compute ANI. If you publish your findings, please do not forget to properly
credit their work.
[fastANI] Kmer size ..........................: 16
[fastANI] Fragment length ....................: 3,000
[fastANI] Min fraction of alignment ..........: 0.25
[fastANI] Num threads to use .................: 1
[fastANI] Log file path ......................: /var/folders/nc/7dlw5z2j16q3s14586qddwl8nhxpgh/T/tmpeywzl020
fastANI similarity metric ....................: calculated
Number of genomes considered .................: 7
Number of redundant genomes ..................: 5
Final number of dereplicated genomes .........: 2
ANI RESULTS
===============================================
* Matrix and clustering of 'ani' written to output directory
* Matrix and clustering of 'alignment fraction' written to output directory
* Matrix and clustering of 'mapping fragments' written to output directory
* Matrix and clustering of 'total fragments' written to output directory
* Cleaning up the temp directory (you can use `--debug` if you would like to keep
it for testing purposes)
The similarity scores output gets written, and the resulting representative genomes for each cluster in derep99Q_test10/GENOMES/
are BA_92.fa
and UW8_POB.fa
. :)
So if you were to install anvio-dev
by following the instructions here: https://anvio.org/install/linux/dev/
and pull the latest commits to the repository, you will be able to run this program without this error.