parsbert with flair

Question

parsbert with flair

rezatakhshid opened this issue 3 years ago · comments

Hi,
I'm getting this error when trying to load embedding using flair. Any idea what's going on?
Am I using the right model? I just need to use the embedding vectors.

The code:

from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings


bert_embedding = TransformerWordEmbeddings("HooshvareLab/bert-fa-base-uncased")
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)

The error:

Traceback (most recent call last):
  File "/Users/reza/code/parsbert/playground.py", line 8, in <module>
    bert_embedding.embed(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
    self._add_embeddings_to_sentence(sentence)
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 995, in _add_embeddings_to_sentence
    encoded_inputs = self.tokenizer.encode_plus(tokenized_string,
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2378, in encode_plus
    return self._encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 458, in _encode_plus
    batched_output = self._batch_encode_plus(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 377, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 335, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert

Mehrdad Farahani · Answer 1 · Fri May 14 2021 22:42:39 GMT+0800 (China Standard Time)

Hi @rezatakhshid ,

The model_max_length hasn't been set in the tokenizer configuration for that version (v2); the easiest and better solution is to use the fresh one (v3).

bert_embedding = TransformerWordEmbeddings('HooshvareLab/bert-fa-zwnj-base')
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)

Some weights of the model checkpoint at HooshvareLab/bert-fa-zwnj-base were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at HooshvareLab/bert-fa-zwnj-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Sentence: "علی اکبر به شهر تهران رفت"   [− Tokens: 6]]

Reza Takhshid · Answer 2 · Sat May 15 2021 03:35:46 GMT+0800 (China Standard Time)

Thanks @m3hrdadfi Jan.