Cannot instantiate Tokenizer

Question

Cannot instantiate Tokenizer

dragonsan17 opened this issue 4 years ago · comments

I am using Huggingface Transformers 4.0.0. When I instantiate the autotokenizer for indicbert, I get the following issue:

My code:
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')

Error:
Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Divyanshu Kakwani · Answer 1 · Wed Nov 25 2020 00:30:14 GMT+0800 (China Standard Time)

Hey, did you try installing sentencepiece package? If not, you can do it using pip3 install sentencepiece

M Santosh · Answer 2 · Wed Nov 25 2020 03:36:05 GMT+0800 (China Standard Time)

Hi. I did install sentencepiece package earlier. I made a new environment and reinstalled it. Now it works fine. Thanks a lot.

ishmeetk · Answer 3 · Fri Dec 04 2020 13:28:36 GMT+0800 (China Standard Time)

Hi. What changes in environment helped? I am also getting error.

Cristiano De Nobili · Answer 4 · Wed Dec 16 2020 17:39:04 GMT+0800 (China Standard Time)

Hi Guys, many of my notebooks have stopped working due to this issue. How it is possible that they changed a so important piece of code? Do you have any solutions?

I installed sentencepiece but it does not work! Thanks

Divyanshu Kakwani · Answer 5 · Wed Dec 16 2020 18:01:51 GMT+0800 (China Standard Time)

Hey, it's working on my system. Can you trying upgrading pip, sentenecepiece and transformers library? Here are the versions that I have:

pip: 20.3.3
sentencepiece: 0.1.94
transformers: 4.0.1

I suspect that the HF transformers is now using a newer version of sentencepiece. Could you check and tell me if updating resolves the issue?

Cristiano De Nobili · Answer 6 · Wed Dec 16 2020 18:16:05 GMT+0800 (China Standard Time)

Hi @divkakwani , thanks for your answer. I have exactly the same versions, but it is stuck. Try to load this model and tell me if it is working (I am using Colab):

model_name = "Musixmatch/umberto-wikipedia-uncased-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=True)
tokenizer

Thanks in advance!

Cristiano De Nobili · Answer 7 · Wed Dec 16 2020 18:21:20 GMT+0800 (China Standard Time)

Here the solution, sorry!

https://github.com/huggingface/transformers/releases/tag/v4.0.0

We must put use_fast=False in the tokenizer!

Thanks again!

Divyanshu Kakwani · Answer 8 · Wed Dec 16 2020 18:23:21 GMT+0800 (China Standard Time)

Hey, thanks for posting the solution here. @ishmeetk Has your issue been resolved too?

Laura Isotalo · Answer 9 · Fri Jan 08 2021 05:14:56 GMT+0800 (China Standard Time)

Hi, I'm also having this problem. Trying to instantiate
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl", use_fast=False)
but I get "ValueError: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer."

But I have already installed sentencepiece. I have:

pip: 20.3.3
sentencepiece: 0.1.94
transformers: 4.1.1

The above code snippet with "Musixmatch/umberto-wikipedia-uncased-v1" also doesn't work for me.

Anyone have more ideas?

Divyanshu Kakwani · Answer 10 · Wed Jan 13 2021 20:10:43 GMT+0800 (China Standard Time)

Hey @LauraIst , it's working for me. No idea what can be causing it? Can you try doing this in a virtualenv:

virtualenv venv
source venv/bin/activate
pip3 install transformers sentencepiece

and then try loading the model in python3 repl.

bt2901 · Answer 11 · Tue Jul 05 2022 19:14:21 GMT+0800 (China Standard Time)

I've encountered the same issue and after some digging I've found the trick. Make sure that sentencepiece is imported before transformers.

The transformers.models.auto.tokenization_auto module initializes some auxiliary dictionary structures depending on what transformers.file_utils.is_sentencepiece_available() returns. If sentencepiece is not available at the import time then transformers would not see it even if it is made available later.

dukgururu · Answer 12 · Sat Oct 29 2022 21:19:19 GMT+0800 (China Standard Time)

@bt2901 thank you for your answer. it works well

FloareDor · Answer 13 · Sun Feb 12 2023 23:22:49 GMT+0800 (China Standard Time)

Nothing worked for me.

imported sentencepiece before transformers.
set use_false=False

EricKong · Answer 14 · Fri Jun 16 2023 15:46:51 GMT+0800 (China Standard Time)

I also it the same problem , it is ok when I run on idea, but after pyinstaller , I hit this error, also when I copy the folder to my the dist ,or user --hidden-import., anyone know what wrong of it ?