huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory overhead in multiprocessing

jordane95 opened this issue · comments

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

Indeed there would maybe be some complications. I would be curious, however, to know what the performance (in terms of speed) implications of loading the model from shared memory would be, have you tested this?

I have a question regarding memory overhead. I created and ran an executor designed to count tokens on approximately 2TB of text (jsonl), but it gets stuck every time I run it. According to the memory and CPU usage data, the memory usage fills up the 256GB I have available, and after getting stuck, the CPU usage drops from 99% to 0%.

The problem is that there are no error messages in the log, making it impossible to resolve the issue. Does anyone have any suggestions on how to address this?
I suspect this might be a memory overhead issue.