AdmiralenOla / Scoary

Pan-genome wide association studies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Maximum number of genomes tested?

dutchscientist opened this issue · comments

I am doing a Scoary test with a 5,829 genome Roary file (~250 Mb) and a custom tree. It works fine in the beginning, but crashes (out of memory?) when storing the pairs. The server I use is Ubuntu 14.04 LTS (Biolinux 8) with 4 core-Xeon processor (8 threads) and 32 Gb RAM and 32 Gb swap.

Is there a maximum to the number of genomes for Scoary?

There's no limit by design at least. Could be a memory issue. The largest data set I've run with was around 3,100 genomes, and that worked fine. Are you getting an error message?

Just that Python (2.7) has crashed. If you want I can get the full message tonight.

It stops when counting the pairs, which is something I am not really interested in. Is it possible to instruct Scoary only to do the p-value and report that back?

Not currently possible, but that would be a useful addition that I will put in the next version for sure.

A possible workaround: Set a low p-value as threshold and invoke just the Individual filtration measure. Scoary will only calculate pairwise comparisons for genes with naïve p-values lower than the threshold, potentially saving a lot of memory.

Ah, that is an interesting suggestion, will try that out 👍

Was also planning to make subsets of the data, to see where it crashes.

Even with p=1E-50, still no joy. This is the error:

Storing results: ST45
Calculating max number of contrasting pairs for each nominally significant gene
100.00%Traceback (most recent call last):
File "/usr/local/bin/scoary", line 11, in
load_entry_point('scoary==1.6.9', 'console_scripts', 'scoary')()
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 244, in main
delimiter=args.delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 813, in StoreResults
num_threads, no_time, delimiter)
File "/usr/local/lib/python2.7/dist-packages/scoary-1.6.9-py2.7.egg/scoary/methods.py", line 920, in StoreTraitResult
Threadresults = list(Threadresults)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
raise value
RuntimeError: maximum recursion depth exceeded while calling a Python object

Gone down to 1E-200 and let Scoary make its own tree, then it works. Now trying to go via 1E-100 until it breaks again.

OK, it is a memory issue. If I let Scoary make the tree, I can get the 250 Mb file to be analysed down to 1E-10. I then used a file double the size (same setting, but no paralog clustering in Roary), and that one crashes at 1E-100, and when I check the memory, then both the 32 Gb RAM and swap are full.

So to sum up there's at least three things for me to do here:

  1. Implementing a "summary statistics only" mode that skips the pairwise comparisons algorithm.
  2. Rewriting code to be less memory-intensive. Currently a lot of metrics are stored in memory and only written to file at the end of analysis. It would probably be possible to improve this by writing to temporary files, destroying objects when they are no longer needed etc.
  3. An investigation into why letting Scoary make the tree has an impact on memory consumption.
    I have no clue as to why that matters.

I hope to be done with 1 fairly quickly, but 2 & 3 might take a bit longer (several months).

Hi Ola,

don't worry too much about it! I thought it would be fun to push Roary and Scoary a bit with a very large dataset, but not sure whether people will really use such datasets, or if they do, whether they have a lot more power than my home-setup.

I am using this as a testground, but will probably make the set smaller by using representatives of the groups and by making smaller subgroups. The -r/-w options of Scoary are great for that, as it makes a smaller Roary set rather than having to rerun Roary every time. :)

The --no_pairwise option is now implemented in the latest version (1.6.11). This is a solution to problem 1 referenced above. I will still have to fix the maximum recursion depth problem, but I'm moving that to a separate issue.

Just a comment: the explanatory text has not been updated to include the option? Am about to try it!

You mean in the Readme? Yeah, that still has the help text for a previous version (1.6.10). But in the actual script the explanatory text (as seen using -h) should be included.

Thanks for submitting an issue and for your very useful suggestion! :-)

Also came across this issue, 700 genomes but a large traits file.

Running on the cluster:
slurmstepd: error: Detected 1 oom-kill event(s) in step 1088755.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.