Multiprocessing with apply_async does not work

Question

Multiprocessing with apply_async does not work

sanjaysrikakulam opened this issue 3 years ago · comments

Sanjay Kumar Srikakulam commented 3 years ago

Sorry again, found one more issue while trying to parallelize and share the index with multiple processes.

Here is the code:

from multiprocessing import Manager, Pool
from pyfastx import Fasta

def print_seq_names(fasta_obj, lock):
    for i in range(5):
        with lock:
            print(fasta_obj[i].name)

def error_call(err):
    print(err)

fasta_obj = Fasta("uniprot_sprot.fasta.gz")
pool = Pool(5)
lock = Manager().Lock()

for i in range(4):
    pool.apply_async(print_seq_names, args=(fasta_obj, lock), error_callback=error_call)

pool.close()
pool.join()

Error:
<multiprocessing.pool.ApplyResult object at 0x7f7605085090>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f201d0>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects

Is it not possible to share the Fasta object or the index or the identifier objects with multiple processes anymore?

Also, if I make the fasta object and the identifier object in my code (the above is a sample dummy code) as a global variable, I could see only 1 process/core at a time be running (out of 64 cores in real code) and the rest of them are in the sleep state. Do you know why this is the behaviour?

Any help here would be great as well,

Thank you!

P.S:
OS: CentOS 7
Python : 3.7.7
pyfastx: 0.8.3

Sanjay Kumar Srikakulam · Answer 1 · Wed Jun 30 2021 21:05:37 GMT+0800 (China Standard Time)

Hi @lmdu,

Any idea on what is actually happening here in the multiprocessing stuff? I am writing a paper for my tool which uses pyfastx and depends on parallelization. Any fix or suggestion to get this working will be really great!

Lianming Du · Answer 2 · Wed Jun 30 2021 21:22:48 GMT+0800 (China Standard Time)

I am so sorry. Pyfastx does not support pickle, you could not use Fasta object as a parameter pass to multiprocessing. It is very complicated to implement this function. Moreover, I have not found a solution to implement file handler sharing between different processes. I would add support for pickle to pyfastx v0.9.0.

Sanjay Kumar Srikakulam · Answer 3 · Wed Jun 30 2021 21:34:25 GMT+0800 (China Standard Time)

OK, thank you for the information. But will there be a memory overhead, if I create a fasta object in every child process?

Say my fasta/fastq index is of size 40 or 50GiB and I use 64 cores, so if each of my processes creates a fasta object, it means there will be a memory overhead, right?

Lianming Du · Answer 4 · Wed Jun 30 2021 21:53:07 GMT+0800 (China Standard Time)

There may be no memory overhead. Pyfastx will not load the entire index into memory.

Sanjay Kumar Srikakulam · Answer 5 · Wed Jun 30 2021 21:56:33 GMT+0800 (China Standard Time)

OK, I will check this and see whether each process loads something in memory when a fasta/fastq object is created in every child process.

Sanjay Kumar Srikakulam · Answer 6 · Thu Jul 01 2021 00:41:43 GMT+0800 (China Standard Time)

Hi @lmdu,

I tried the apply_async technique in the pyfastx's documentation. Like re-creating the fasta/fastq object inside the worker process, there is no memory overhead, but only one or two processes run out of 64 initiated processes and the rest of them goes to a sleep state. This won't really work for multiprocessing. I look forward to pyfastx v0.9.0.

Thank you for your support and quick response!