Assembling big data

Question

Assembling big data

lzaramela opened this issue 5 years ago · comments

Hey,
I have a big dataset (>600M paired-end reads) and I am trying to generate a protein catalog using Plass. I am using the version 2.c7e35 in a server with 900Gb ram. The processing is ending without completion due to exceeding the resources requested. I am wondering if it is possible to tweak the parameters to allocate less memory.
Any input will be greatly appreciated.
Thanks,
Livia

Milot Mirdita · Answer 1 · Thu Jul 25 2019 02:39:34 GMT+0800 (China Standard Time)

Hi Livia,

Could you please post the log of the run? Plass should split up the work so it always fits into the available memory.

Best regards,
Milot

lzaramela · Answer 2 · Thu Jul 25 2019 02:53:29 GMT+0800 (China Standard Time)

Sure... here is the log file
PLASS_West.txt

I got the following message:
Execution terminated
Exit_status=271
resources_used.cput=46:09:04
resources_used.mem=531170280kb
resources_used.vmem=832592604kb
resources_used.walltime=42:54:33

Martin Steinegger · Answer 3 · Thu Jul 25 2019 09:46:16 GMT+0800 (China Standard Time)

Thanks a lot! How much memory does your machine have? Normally Plass try to split the database if it does not fit in memory.

lzaramela · Answer 4 · Fri Jul 26 2019 01:12:06 GMT+0800 (China Standard Time)

CentOS server, I can use up to 900Gb ram.

Martin Steinegger · Answer 5 · Fri Jul 26 2019 05:29:47 GMT+0800 (China Standard Time)

So it seems that the extractorfs step is hanging, which mostly requires IO. Is it possible that the tmp folder is on some slow network share?

One trick to reduce the amount of sequences extracted is to increase the minimum orf length with --min-length (default: 20).