frederic-mahe / mumu

C++ implementation of lulu, a R package for post-clustering curation of metabarcoding data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

warning: one of these is not in the OTU table

mimouschka opened this issue · comments

Hi @frederic-mahe
thanks for MUMU, this is great! and very well explained ;-)
Following the manual, I did a matchlist using:

vsearch --usearch_global lulu_input/otu_refined.fasta --db lulu_input/otu_refined.fasta --self --id 0.84 --iddef 1 --userfields query+target+id --maxaccepts 0 --query_cov 0.9 --maxhits 10 --userout lulu/mumu.matchlist

and then ran mumu:

mumu --otu_table lulu_input/otu_counts_refined.tsv --match_list lulu/mumu.matchlist --new_otu_table lulu/OTU_table_mumu_84_0.9_1.tsv --log lulu/mumu.log --minimum_relative_cooccurence 0.90

I get many warnings such as:

warning: one of these is not in the OTU table: SWARM_8316 SWARM_2575 95.2

As I produced the otu_refined.fasta and the otu_counts_refined table from the same phyloseq object, I don't really understand how some OTUs in the match list are not in the OTU table...

hi @mimouschka there could be several causes to that particular issue. I don't have enough information to answer you more precisely.

could you please check if SWARM_8316 and SWARM_2575 are present in otu_counts_refined.tsv? check for exact full-length matches.

vsearch defines a fasta header as the string comprised between the initial '>' symbol and the first space, tab or the end of the line. If your OTU identifiers contain spaces or non-ascii characters, vsearch will truncate the identifiers.

hi again!
I do confirm that both SWARM_8316 and SWARM_2575 are present in the otu_counts_refined.tsv file.

My fasta headers are in the format ">SWARM_8316", so no spaces in headers.

attached the files (tsv and fasta)
mumu-test.zip

The issue comes from your OTU table otu_counts_refined.tsv.

In the fasta file and the generated matchlist mumu.matchlist, OTU identifiers are written as such: SWARM_8316, whereas in the OTU table, identifiers are quoted "SWARM_8316". mumu sees that as two different identifiers (which is correct). When exporting your table into a tsv file, you can choose to surround strings with quotes or not. Alternatively, you can remove the quotes post-facto:

sed -i 's/\"//g' otu_counts_refined.tsv

If you're starting from a phyloseq object, I suggest you try the mumu wrapper mumu_pq.

Don't forget to close the issue if you consider it solved.

amazing, thanks so much ! Yes, I think I forgot to use quote=F when exporting the tsv file, but your post-facto command worked perfectly too :)

Now, regarding your last comment on the mumu wrapper, I was actually thinking to use mumu right after swarm as it seems I can increase the number of threads, so mumu can handle larger tables. Do you think that would work?
the main reason I first went into R before lulu was to clean up the data to have a smaller table to curate, so if I can spare myself the back and forth that would be great :)

Now, regarding your last comment on the mumu wrapper, I was actually thinking to use mumu right after swarm as it seems I can increase the number of threads, so mumu can handle larger tables. Do you think that would work?

swarm and vsearch are multithreaded and can take advantage of multi-core CPUs. However, mumu is not yet multithreaded. There is a --threads option but it has no effect, mumu always uses one thread. I've developed mumu because lulu was too slow for the dataset I wanted to process (10,000 samples, 100,000 OTUs). mumu is currently fast enough to process such datasets in less than an hour, so there is little incentive to make it multithreaded.

That being said, I would like to make mumu faster at parsing input files. That could bring a 2x speed up.