hooshvare / parsbert

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

Home Page:https://doi.org/10.1007/s11063-021-10528-4

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fine-Tune on our dataset

miladfa7 opened this issue · comments

how to fine-tune ParseBERT Model on our dataset?
Please help me ...
thanks

You can use this Colab to fine-tuning your dataset based on the text classification tasks. For other down-stream tasks, I'm afraid to say that you need to be patient. I'll add others soon!

how to get embedding of ParsBert pretrain model?

I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the error

raise ValueError ('No model found in config file.')
ValueError: No model found in config file.

How can I add a config file so that I do not get this error?

I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the error

raise ValueError ('No model found in config file.')
ValueError: No model found in config file.

How can I add a config file so that I do not get this error?

Did you fine-tune parsbert on your dataset? Which methods did you use (PyTorch, TensorFlow, Script)? Did you save your model except for the script technique (What type of files do you have on your saved model directory)?

https://github.com/hooshvare/parsbert/blob/master/notebooks/Taaghche_Sentiment_Analysis.ipynb

I made my model from this link
Yes, I fine-tuned my data and i have 3 labels
I have two saved files in pytorch_model.bin
Named: tf_model.h5 and config.json
I load my model this way
from keras.models import load_model
model = keras.models.load_model ('tf_model.h5')

Ok, then. Your model fine-tuned on Transformers you can't load the model just as simple as Keras...load you must load your fine-tuned model using transformers
if you have tf_model.h5 on your saved directory, use this:

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY)

otherwise, if you have pytorch_model.bin

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY, from_pt=True)

Also, make sure you have config.json and vocab.txt in your directory!

With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txt

from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')

This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txt

from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')

This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Link used
Nothing called this file saved

First of all, you can download vocab.txt from here.

https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txt

Secondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of bert-fa-cls-base-uncased and it includes:

+ bert-fa-cls-base-uncased
    - config.json
    - vocab.txt
    - tf_model.h5

you need to pass the directory, not the model singly; I mean, load your model using this piece of code:

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")

First of all, you can download vocab.txt from here.

https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txt

Secondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of bert-fa-cls-base-uncased and it includes:

+ bert-fa-cls-base-uncased
    - config.json
    - vocab.txt
    - tf_model.h5

you need to pass the directory, not the model singly; I mean, load your model using this piece of code:

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")

please help me
i get input sentence and i want predict label from the save model
But I think the preprocessing part and the pad of my sentence were done wrong
Because I can not predict label

from transformers import BertConfig, BertTokenizer
MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-base-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
sample_comment= "شعار ما هوش مصنوعی برای همه است"
max_length=32
tokens = tokenizer.tokenize(sample_comment, padding=True, max_length=42)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./pytorch_model.bin/")
predictions = tf_model.predict(token_ids)
print(predictions)

The whole process is as simple as you think! but before dive into it, we need to set some grounds

  1. The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
  2. The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

  1. Load the packages
  2. Load the config, tokenizer, and the model
  3. The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

The whole process is as simple as you think! but before dive into it, we need to set some grounds

  1. The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
  2. The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

  1. Load the packages
  2. Load the config, tokenizer, and the model
  3. The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

Thank you very much for your help

Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.

I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive

اصلاح طلبی باید راهبرد راهگشایی برای برونرفت حاکمیت ازین بن بست سیاسی چهل ساله که ریشه همه یا کمینه بیشتر مشکلات کنونی کشوراست ارائه کند وانهم تناقض وتنافر بزرگ حاکمیتی یعنی جمهوریت وولایت مطلقه است تا موضع وسمت وسوی خود را شفاف وبوضوح بیان نکند ازاصلاح طلبی فقط همان نامش را یدک میکشد,positive,political

label sentence is a positive but predict political

my model have 3 labels :

label2id: {'negative': 0, 'political': 1, 'positive': 2}

id2label: {0: 'negative', 1: 'political', 2: 'positive'}

How do you think I can improve the accuracy and floss of the model to have a better prediction?

The whole process is as simple as you think! but before dive into it, we need to set some grounds

  1. The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
  2. The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

  1. Load the packages
  2. Load the config, tokenizer, and the model
  3. The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.

I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive?

hello mehrdad
can you help me again?
What parameters can I change to get better accuracy?
I changed these parameters to some extent:
MAX_LEN = 64
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16
EPOCHS = 10
EEVERY_EPOCH = 500
LEARNING_RATE = 2e-5
CLIP = 0.0

But my lossـvalidation value increased and the accuracy decreased

How can I increase my accuracy to have a better forecast(predict)?

Epoch 1/10
3260/3260 [==============================] - 1011s 310ms/step - loss: 0.3203 - accuracy: 0.8780 - val_loss: 0.2650 - val_accuracy: 0.9039
Epoch 2/10
3260/3260 [==============================] - 1012s 310ms/step - loss: 0.1934 - accuracy: 0.9294 - val_loss: 0.2776 - val_accuracy: 0.9107
Epoch 3/10
3260/3260 [==============================] - 1013s 311ms/step - loss: 0.1207 - accuracy: 0.9580 - val_loss: 0.3280 - val_accuracy: 0.8958