Load local tokenizer
jordane95 opened this issue · comments
Due to some network issues, I need to first download and load the tokenizer from local path. But the current tokenizer only supports identifier-based loading from hf. Is it possible to add a local load from path function like AutoTokenizer in transformers
lib?
if you replace Tokenizer.from_pretrained
with Tokenizer.from_file
in the source does it work or is the tokenizer not in the right format? If it works I can add a check to see if the tokenizer name is a valid path and if so load using from_file
in that case
Yeah, I'm currently using from_file and it works fine
I will add a check then. Are you passing the path to the folder or to the json file directly?
json file
Also maybe we should add option for BPE tokenizer in MinhashDedup?
Also maybe we should add option for BPE tokenizer in MinhashDedup?
you mean instead of word_tokenize
?
Also maybe we should add option for BPE tokenizer in MinhashDedup?
you mean instead of
word_tokenize
?
Yeah, maybe we could support both by changing the function a bit?
Could we use AutoTokenizer
from transformers
for this? It would be much more flexible than using the raw tokenizers
class.
I would really like to avoid the dependency on transformers
just for this, which is a big lib. I'll make a PR with the Tokenizer.from_file
change
Take a look at the linked PR