MinhashDedupCluster runs too slow
jordane95 opened this issue · comments
Zehan Li commented
Hi,
I find that the cluster stage in MinhashDedup runs too slow by using only one cpu to construct the union set of duplicate documents. For example, with 1T data, we can calculate the hash values in 2h with 1k cores, but the cluster stage takes more than 7h. I'm wondering whether there is any acceleration strategy for this stage.
Guilherme Penedo commented
Currently there isn't anything supported for this stage. You might get a perf boost if you read the files from a local path instead of on the cloud but that's about it