MinhashDedupCluster runs too slow

Question

MinhashDedupCluster runs too slow

jordane95 opened this issue 4 months ago · comments

Hi,

I find that the cluster stage in MinhashDedup runs too slow by using only one cpu to construct the union set of duplicate documents. For example, with 1T data, we can calculate the hash values in 2h with 1k cores, but the cluster stage takes more than 7h. I'm wondering whether there is any acceleration strategy for this stage.

Guilherme Penedo · Answer 1 · Mon Feb 26 2024 21:03:07 GMT+0800 (China Standard Time)

Currently there isn't anything supported for this stage. You might get a perf boost if you read the files from a local path instead of on the cloud but that's about it