parsbert with flair
rezatakhshid opened this issue · comments
Hi,
I'm getting this error when trying to load embedding using flair. Any idea what's going on?
Am I using the right model? I just need to use the embedding vectors.
The code:
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings
bert_embedding = TransformerWordEmbeddings("HooshvareLab/bert-fa-base-uncased")
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)
The error:
Traceback (most recent call last):
File "/Users/reza/code/parsbert/playground.py", line 8, in <module>
bert_embedding.embed(sentence)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed
self._add_embeddings_internal(sentences)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
self._add_embeddings_to_sentence(sentence)
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/flair/embeddings/token.py", line 995, in _add_embeddings_to_sentence
encoded_inputs = self.tokenizer.encode_plus(tokenized_string,
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2378, in encode_plus
return self._encode_plus(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 458, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 377, in _batch_encode_plus
self.set_truncation_and_padding(
File "/Users/reza/.pyenv/versions/parsbert/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 335, in set_truncation_and_padding
self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert
Hi @rezatakhshid ,
The model_max_length
hasn't been set in the tokenizer configuration for that version (v2
); the easiest and better solution is to use the fresh one (v3
).
bert_embedding = TransformerWordEmbeddings('HooshvareLab/bert-fa-zwnj-base')
sentence = Sentence('علی اکبر به شهر تهران رفت')
bert_embedding.embed(sentence)
Some weights of the model checkpoint at HooshvareLab/bert-fa-zwnj-base were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at HooshvareLab/bert-fa-zwnj-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Sentence: "علی اکبر به شهر تهران رفت" [− Tokens: 6]]
Thanks @m3hrdadfi Jan.