huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenization in Minhash deduplication

jordane95 opened this issue · comments

Hi,

I have noticed that the tokenization is different from those adopted by previous papers.

For example, this paper uses space tokenization, refinedweb states that they used GPT-2 tokenizer, while datatrove adopts nltk to extract n-grams.

I'm wondering whether the results obtained by different tokenization methods are consistent.

Hi,
We normalize the text before applying nltk word_tokenize, so our tokenization should not be too different from just using space tokenization.
Regarding using gpt2, it is one possible way to convert the ngrams (text) to numbers, but ends up being slower (specially for very big documents) and sometimes can lead to some ambiguity. Imagine the following pair of different 5-grams:

  • hello my name is Frankenstein
  • hello my name is Frankensteiner

If you tokenize them with GPT2:

  • 31373 616 1438 318 45738
  • 31373 616 1438 318 45738 263

so in this case you would still have a match when checking for tokens, even though the word itself is different. In some other cases the same word or part of word may also have a different token depending on the surrounding text.

Apart from these specific edge cases I would say that for the most part results would still be consistent with datatrove's implementation.