What is a huge dataset?

Question

What is a huge dataset?

skodapetr opened this issue 4 years ago · comments

In the training-tutorial.md there is a note '''Turn off for huge datasets that won't fit to memory.'''. It would really help to give an estimate of how much memory does the p2rank use. As one may consider a hundreds of proteins a small dataset that should fit into main memory. For example if I have 2500 proteins and 32G RAM I would expect that it is fine, but is it?

rdk · Answer 1 · Mon Mar 09 2020 23:01:03 GMT+0800 (China Standard Time)

This doesn't really have a simple answer. It depends on many factors including size of proteins, density of SAS points and number of RF trees you want to train in parallell. Yes, 2500 proteins of average size (say a 2500 atoms) will definitely fit into 32GB of RAM, but it might not be enough RAM to train RF in parallel on 16 threads.

I have reformulated the tutorials and added section Raquired memory and memory/time trade-offs to the training-tutorial.md. Check if you are still missing any information there.

I don't have particular memory usage estimates at hand, I will try to measure them and include them at some point - or will welcome your contribution.
One recent estimate though: training on MG dataset of ~2200 proteins with 12 threads, default density and no subsampling needed ~55GB RAM.