Memory overhead in multiprocessing

Question

Memory overhead in multiprocessing

jordane95 opened this issue 3 months ago · comments

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

Guilherme Penedo · Answer 1 · Wed Apr 24 2024 22:09:23 GMT+0800 (China Standard Time)

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

sungjun lee · Answer 2 · Fri May 24 2024 17:44:23 GMT+0800 (China Standard Time)

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this?
I suspect this might be a memory overhead issue.