lmdu / pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files

Home Page:https://pyfastx.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiprocessing with apply_async does not work

sanjaysrikakulam opened this issue · comments

Hi @lmdu ,

Sorry again, found one more issue while trying to parallelize and share the index with multiple processes.

Here is the code:

from multiprocessing import Manager, Pool
from pyfastx import Fasta

def print_seq_names(fasta_obj, lock):
    for i in range(5):
        with lock:
            print(fasta_obj[i].name)

def error_call(err):
    print(err)

fasta_obj = Fasta("uniprot_sprot.fasta.gz")
pool = Pool(5)
lock = Manager().Lock()

for i in range(4):
    pool.apply_async(print_seq_names, args=(fasta_obj, lock), error_callback=error_call)

pool.close()
pool.join()

Error:
<multiprocessing.pool.ApplyResult object at 0x7f7605085090>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f201d0>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects

Is it not possible to share the Fasta object or the index or the identifier objects with multiple processes anymore?

Also, if I make the fasta object and the identifier object in my code (the above is a sample dummy code) as a global variable, I could see only 1 process/core at a time be running (out of 64 cores in real code) and the rest of them are in the sleep state. Do you know why this is the behaviour?

Any help here would be great as well,

Thank you!

P.S:
OS: CentOS 7
Python : 3.7.7
pyfastx: 0.8.3

Hi @lmdu,

Any idea on what is actually happening here in the multiprocessing stuff? I am writing a paper for my tool which uses pyfastx and depends on parallelization. Any fix or suggestion to get this working will be really great!

I am so sorry. Pyfastx does not support pickle, you could not use Fasta object as a parameter pass to multiprocessing. It is very complicated to implement this function. Moreover, I have not found a solution to implement file handler sharing between different processes. I would add support for pickle to pyfastx v0.9.0.

OK, thank you for the information. But will there be a memory overhead, if I create a fasta object in every child process?

Say my fasta/fastq index is of size 40 or 50GiB and I use 64 cores, so if each of my processes creates a fasta object, it means there will be a memory overhead, right?

There may be no memory overhead. Pyfastx will not load the entire index into memory.

OK, I will check this and see whether each process loads something in memory when a fasta/fastq object is created in every child process.

Hi @lmdu,

I tried the apply_async technique in the pyfastx's documentation. Like re-creating the fasta/fastq object inside the worker process, there is no memory overhead, but only one or two processes run out of 64 initiated processes and the rest of them goes to a sleep state. This won't really work for multiprocessing. I look forward to pyfastx v0.9.0.

Thank you for your support and quick response!