Tokenization in Minhash deduplication

Question

Tokenization in Minhash deduplication

jordane95 opened this issue 4 months ago · comments

Hi,

I have noticed that the tokenization is different from those adopted by previous papers.

For example, this paper uses space tokenization, refinedweb states that they used GPT-2 tokenizer, while datatrove adopts nltk to extract n-grams.

I'm wondering whether the results obtained by different tokenization methods are consistent.

Guilherme Penedo · Answer 1 · Wed Jan 31 2024 16:59:02 GMT+0800 (China Standard Time)

Hi,
We normalize the text before applying nltk word_tokenize, so our tokenization should not be too different from just using space tokenization.
Regarding using gpt2, it is one possible way to convert the ngrams (text) to numbers, but ends up being slower (specially for very big documents) and sometimes can lead to some ambiguity. Imagine the following pair of different 5-grams:

hello my name is Frankenstein
hello my name is Frankensteiner

If you tokenize them with GPT2:

31373 616 1438 318 45738
31373 616 1438 318 45738 263

so in this case you would still have a match when checking for tokens, even though the word itself is different. In some other cases the same word or part of word may also have a different token depending on the surrounding text.

Apart from these specific edge cases I would say that for the most part results would still be consistent with datatrove's implementation.