Maximum number of genomes tested?

Question

Maximum number of genomes tested?

dutchscientist opened this issue 8 years ago · comments

I am doing a Scoary test with a 5,829 genome Roary file (~250 Mb) and a custom tree. It works fine in the beginning, but crashes (out of memory?) when storing the pairs. The server I use is Ubuntu 14.04 LTS (Biolinux 8) with 4 core-Xeon processor (8 threads) and 32 Gb RAM and 32 Gb swap.

Is there a maximum to the number of genomes for Scoary?

dutchscientist commented 8 years ago

Thanks!

Ola Brynildsrud · Answer 1 · Mon Jan 23 2017 20:06:54 GMT+0800 (China Standard Time)

There's no limit by design at least. Could be a memory issue. The largest data set I've run with was around 3,100 genomes, and that worked fine. Are you getting an error message?

dutchscientist · Answer 2 · Mon Jan 23 2017 21:19:43 GMT+0800 (China Standard Time)

Just that Python (2.7) has crashed. If you want I can get the full message tonight.

It stops when counting the pairs, which is something I am not really interested in. Is it possible to instruct Scoary only to do the p-value and report that back?

Ola Brynildsrud · Answer 3 · Mon Jan 23 2017 21:45:48 GMT+0800 (China Standard Time)

Not currently possible, but that would be a useful addition that I will put in the next version for sure.

A possible workaround: Set a low p-value as threshold and invoke just the Individual filtration measure. Scoary will only calculate pairwise comparisons for genes with naïve p-values lower than the threshold, potentially saving a lot of memory.

dutchscientist · Answer 4 · Mon Jan 23 2017 22:01:00 GMT+0800 (China Standard Time)

Ah, that is an interesting suggestion, will try that out 👍

Was also planning to make subsets of the data, to see where it crashes.

dutchscientist · Answer 5 · Sun Jan 29 2017 07:26:58 GMT+0800 (China Standard Time)

Even with p=1E-50, still no joy. This is the error:

Storing results: ST45
Calculating max number of contrasting pairs for each nominally significant gene
100.00%Traceback (most recent call last):
File "/usr/local/bin/scoary", line 11, in
load_entry_point('scoary==1.6.9', 'console_scripts', 'scoary')()
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 244, in main
delimiter=args.delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 813, in StoreResults
num_threads, no_time, delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 920, in StoreTraitResult
Threadresults = list(Threadresults)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
RuntimeError: maximum recursion depth exceeded while calling a Python object

dutchscientist · Answer 6 · Sun Jan 29 2017 10:21:31 GMT+0800 (China Standard Time)

Gone down to 1E-200 and let Scoary make its own tree, then it works. Now trying to go via 1E-100 until it breaks again.

dutchscientist · Answer 7 · Sun Jan 29 2017 23:15:22 GMT+0800 (China Standard Time)

OK, it is a memory issue. If I let Scoary make the tree, I can get the 250 Mb file to be analysed down to 1E-10. I then used a file double the size (same setting, but no paralog clustering in Roary), and that one crashes at 1E-100, and when I check the memory, then both the 32 Gb RAM and swap are full.

Ola Brynildsrud · Answer 8 · Mon Jan 30 2017 16:20:39 GMT+0800 (China Standard Time)

So to sum up there's at least three things for me to do here:

Implementing a "summary statistics only" mode that skips the pairwise comparisons algorithm.
Rewriting code to be less memory-intensive. Currently a lot of metrics are stored in memory and only written to file at the end of analysis. It would probably be possible to improve this by writing to temporary files, destroying objects when they are no longer needed etc.
An investigation into why letting Scoary make the tree has an impact on memory consumption.
I have no clue as to why that matters.

I hope to be done with 1 fairly quickly, but 2 & 3 might take a bit longer (several months).

dutchscientist · Answer 9 · Mon Jan 30 2017 19:12:43 GMT+0800 (China Standard Time)

Hi Ola,

don't worry too much about it! I thought it would be fun to push Roary and Scoary a bit with a very large dataset, but not sure whether people will really use such datasets, or if they do, whether they have a lot more power than my home-setup.

I am using this as a testground, but will probably make the set smaller by using representatives of the groups and by making smaller subgroups. The -r/-w options of Scoary are great for that, as it makes a smaller Roary set rather than having to rerun Roary every time. :)

Ola Brynildsrud · Answer 10 · Fri Mar 31 2017 19:43:57 GMT+0800 (China Standard Time)

The --no_pairwise option is now implemented in the latest version (1.6.11). This is a solution to problem 1 referenced above. I will still have to fix the maximum recursion depth problem, but I'm moving that to a separate issue.

dutchscientist · Answer 11 · Fri Apr 14 2017 07:12:14 GMT+0800 (China Standard Time)

Just a comment: the explanatory text has not been updated to include the option? Am about to try it!

Ola Brynildsrud · Answer 12 · Tue Apr 18 2017 15:55:12 GMT+0800 (China Standard Time)

You mean in the Readme? Yeah, that still has the help text for a previous version (1.6.10). But in the actual script the explanatory text (as seen using -h) should be included.

dutchscientist · Answer 13 · Tue Apr 18 2017 19:32:57 GMT+0800 (China Standard Time)

Yes, that's right, I meant the website. Am very happy with the new version, that does exactly what I want (the identification of the differentially represented genes), also with the large dataset, and is very quick now :)

Ola Brynildsrud · Answer 14 · Tue Apr 18 2017 19:39:31 GMT+0800 (China Standard Time)

Thanks for submitting an issue and for your very useful suggestion! :-)

Jambler · Answer 15 · Thu Mar 04 2021 16:11:23 GMT+0800 (China Standard Time)

Also came across this issue, 700 genomes but a large traits file.

Running on the cluster:
slurmstepd: error: Detected 1 oom-kill event(s) in step 1088755.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.