huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MinhashDedupCluster runs too slow

jordane95 opened this issue · comments

Hi,

I find that the cluster stage in MinhashDedup runs too slow by using only one cpu to construct the union set of duplicate documents. For example, with 1T data, we can calculate the hash values in 2h with 1k cores, but the cluster stage takes more than 7h. I'm wondering whether there is any acceleration strategy for this stage.

Currently there isn't anything supported for this stage. You might get a perf boost if you read the files from a local path instead of on the cloud but that's about it