huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Migrate from sha1 to xxhash for deduplication methods

hynky1999 opened this issue · comments

Problem

We currently use first x bytes sha1 for hashing, which is a waste of resources.

  • we don't need cryptographic guarantees
  • we only take first x bytes

Instead we should use non-cryptographic hash function, which can be computed significantly faster.
This should speed significantly the first phase of deduplication process