AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

Home Page:https://indicnlp.ai4bharat.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot instantiate Tokenizer

dragonsan17 opened this issue · comments

I am using Huggingface Transformers 4.0.0. When I instantiate the autotokenizer for indicbert, I get the following issue:

My code:
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')

Error:
Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Hey, did you try installing sentencepiece package? If not, you can do it using pip3 install sentencepiece

Hi. I did install sentencepiece package earlier. I made a new environment and reinstalled it. Now it works fine. Thanks a lot.

Hi. What changes in environment helped? I am also getting error.

Hi Guys, many of my notebooks have stopped working due to this issue. How it is possible that they changed a so important piece of code? Do you have any solutions?

I installed sentencepiece but it does not work! Thanks

Hey, it's working on my system. Can you trying upgrading pip, sentenecepiece and transformers library? Here are the versions that I have:

  • pip: 20.3.3
  • sentencepiece: 0.1.94
  • transformers: 4.0.1

I suspect that the HF transformers is now using a newer version of sentencepiece. Could you check and tell me if updating resolves the issue?

Hi @divkakwani , thanks for your answer. I have exactly the same versions, but it is stuck. Try to load this model and tell me if it is working (I am using Colab):

model_name = "Musixmatch/umberto-wikipedia-uncased-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=True)
tokenizer 

Thanks in advance!

Here the solution, sorry!

https://github.com/huggingface/transformers/releases/tag/v4.0.0

We must put use_fast=False in the tokenizer!

Thanks again!

Hey, thanks for posting the solution here. @ishmeetk Has your issue been resolved too?

Hi, I'm also having this problem. Trying to instantiate
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl", use_fast=False)
but I get "ValueError: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer."

But I have already installed sentencepiece. I have:

  • pip: 20.3.3
  • sentencepiece: 0.1.94
  • transformers: 4.1.1

The above code snippet with "Musixmatch/umberto-wikipedia-uncased-v1" also doesn't work for me.

Anyone have more ideas?

Hey @LauraIst , it's working for me. No idea what can be causing it? Can you try doing this in a virtualenv:

virtualenv venv
source venv/bin/activate
pip3 install transformers sentencepiece

and then try loading the model in python3 repl.

I've encountered the same issue and after some digging I've found the trick. Make sure that sentencepiece is imported before transformers.

The transformers.models.auto.tokenization_auto module initializes some auxiliary dictionary structures depending on what transformers.file_utils.is_sentencepiece_available() returns. If sentencepiece is not available at the import time then transformers would not see it even if it is made available later.

@bt2901 thank you for your answer. it works well

Nothing worked for me.

imported sentencepiece before transformers.
set use_false=False

I also it the same problem , it is ok when I run on idea, but after pyinstaller , I hit this error, also when I copy the folder to my the dist ,or user --hidden-import., anyone know what wrong of it ?