huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Load local tokenizer

jordane95 opened this issue · comments

Due to some network issues, I need to first download and load the tokenizer from local path. But the current tokenizer only supports identifier-based loading from hf. Is it possible to add a local load from path function like AutoTokenizer in transformers lib?

if you replace Tokenizer.from_pretrained with Tokenizer.from_file in the source does it work or is the tokenizer not in the right format? If it works I can add a check to see if the tokenizer name is a valid path and if so load using from_file in that case

Yeah, I'm currently using from_file and it works fine

I will add a check then. Are you passing the path to the folder or to the json file directly?

json file

Also maybe we should add option for BPE tokenizer in MinhashDedup?

Also maybe we should add option for BPE tokenizer in MinhashDedup?

you mean instead of word_tokenize?

Also maybe we should add option for BPE tokenizer in MinhashDedup?

you mean instead of word_tokenize?

Yeah, maybe we could support both by changing the function a bit?

Could we use AutoTokenizer from transformers for this? It would be much more flexible than using the raw tokenizers class.

I would really like to avoid the dependency on transformers just for this, which is a big lib. I'll make a PR with the Tokenizer.from_file change

Take a look at the linked PR