Fine-Tune on our dataset
miladfa7 opened this issue · comments
how to fine-tune ParseBERT Model on our dataset?
Please help me ...
thanks
You can use this Colab to fine-tuning your dataset based on the text classification tasks. For other down-stream tasks, I'm afraid to say that you need to be patient. I'll add others soon!
how to get embedding of ParsBert pretrain model?
I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the error
raise ValueError ('No model found in config file.')
ValueError: No model found in config file.
How can I add a config file so that I do not get this error?
I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the errorraise ValueError ('No model found in config file.')
ValueError: No model found in config file.How can I add a config file so that I do not get this error?
Did you fine-tune parsbert on your dataset? Which methods did you use (PyTorch, TensorFlow, Script)? Did you save your model except for the script technique (What type of files do you have on your saved model directory)?
https://github.com/hooshvare/parsbert/blob/master/notebooks/Taaghche_Sentiment_Analysis.ipynb
I made my model from this link
Yes, I fine-tuned my data and i have 3 labels
I have two saved files in pytorch_model.bin
Named: tf_model.h5 and config.json
I load my model this way
from keras.models import load_model
model = keras.models.load_model ('tf_model.h5')
Ok, then. Your model fine-tuned on Transformers you can't load the model just as simple as Keras...load
you must load your fine-tuned model using transformers
if you have tf_model.h5
on your saved directory, use this:
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY)
otherwise, if you have pytorch_model.bin
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY, from_pt=True)
Also, make sure you have config.json
and vocab.txt
in your directory!
With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txt
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')
This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txtfrom transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
Link used
Nothing called this file saved
First of all, you can download vocab.txt
from here.
https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txt
Secondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of bert-fa-cls-base-uncased
and it includes:
+ bert-fa-cls-base-uncased
- config.json
- vocab.txt
- tf_model.h5
you need to pass the directory, not the model singly; I mean, load your model using this piece of code:
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")
First of all, you can download
vocab.txt
from here.https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txtSecondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of
bert-fa-cls-base-uncased
and it includes:+ bert-fa-cls-base-uncased - config.json - vocab.txt - tf_model.h5you need to pass the directory, not the model singly; I mean, load your model using this piece of code:
from transformers import TFAutoModelForSequenceClassification tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")
please help me
i get input sentence and i want predict label from the save model
But I think the preprocessing part and the pad of my sentence were done wrong
Because I can not predict label
from transformers import BertConfig, BertTokenizer
MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-base-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
sample_comment= "شعار ما هوش مصنوعی برای همه است"
max_length=32
tokens = tokenizer.tokenize(sample_comment, padding=True, max_length=42)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./pytorch_model.bin/")
predictions = tf_model.predict(token_ids)
print(predictions)
The whole process is as simple as you think! but before dive into it, we need to set some grounds
- The fine-tuned model saved on a directory in this case
bert-fa-base-uncased-sentiment-snappfood
- The directory consists of these properties:
config.json tf_model.h5 vocab.txt
I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood
the procedure is as follow:
- Load the packages
- Load the config, tokenizer, and the model
- The inference
0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.
Step 0
!pip install -qU transformers
!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt
!ls /content/bert-fa-base-uncased-sentiment-snappfood
Output
config.json tf_model.h5 vocab.txt
Step 1
from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer
import tensorflow as tf
import numpy as np
Step 2
config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()
Output
All model checkpoint weights were used when initializing TFBertForSequenceClassification.
All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 162841344
_________________________________________________________________
dropout_37 (Dropout) multiple 0
_________________________________________________________________
classifier (Dense) multiple 1538
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________
Step 3
prompt = 'این خوراک بسیار خوب است'
inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()
labels = config.id2label
print(scores)
print(labels[prediction])
Output
[0.9952093 0.00479068]
HAPPY
The whole process is as simple as you think! but before dive into it, we need to set some grounds
- The fine-tuned model saved on a directory in this case
bert-fa-base-uncased-sentiment-snappfood
- The directory consists of these properties:
config.json tf_model.h5 vocab.txt
I'm going to demonstrate the entire steps regarding one of our models
base-uncased-sentiment-snappfood
the procedure is as follow:
- Load the packages
- Load the config, tokenizer, and the model
- The inference
0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.
Step 0
!pip install -qU transformers !mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood !wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json !wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 !wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt !ls /content/bert-fa-base-uncased-sentiment-snappfoodOutput
config.json tf_model.h5 vocab.txt
Step 1
from transformers import TFBertForSequenceClassification from transformers import AutoConfig from transformers import AutoTokenizer import tensorflow as tf import numpy as npStep 2
config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') model.summary()Output
All model checkpoint weights were used when initializing TFBertForSequenceClassification. All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training. Model: "tf_bert_for_sequence_classification" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= bert (TFBertMainLayer) multiple 162841344 _________________________________________________________________ dropout_37 (Dropout) multiple 0 _________________________________________________________________ classifier (Dense) multiple 1538 ================================================================= Total params: 162,842,882 Trainable params: 162,842,882 Non-trainable params: 0 _________________________________________________________________
Step 3
prompt = 'این خوراک بسیار خوب است' inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True) logits = model(inputs)[0] outputs = tf.keras.backend.softmax(logits) prediction = tf.argmax(outputs, axis=1) prediction = prediction[0].numpy() scores = outputs[0].numpy() labels = config.id2label print(scores) print(labels[prediction])Output
[0.9952093 0.00479068] HAPPY
Thank you very much for your help
Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.
I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive
اصلاح طلبی باید راهبرد راهگشایی برای برونرفت حاکمیت ازین بن بست سیاسی چهل ساله که ریشه همه یا کمینه بیشتر مشکلات کنونی کشوراست ارائه کند وانهم تناقض وتنافر بزرگ حاکمیتی یعنی جمهوریت وولایت مطلقه است تا موضع وسمت وسوی خود را شفاف وبوضوح بیان نکند ازاصلاح طلبی فقط همان نامش را یدک میکشد,positive,political
label sentence is a positive but predict political
my model have 3 labels :
label2id: {'negative': 0, 'political': 1, 'positive': 2}
id2label: {0: 'negative', 1: 'political', 2: 'positive'}
How do you think I can improve the accuracy and floss of the model to have a better prediction?
The whole process is as simple as you think! but before dive into it, we need to set some grounds
- The fine-tuned model saved on a directory in this case
bert-fa-base-uncased-sentiment-snappfood
- The directory consists of these properties:
config.json tf_model.h5 vocab.txt
I'm going to demonstrate the entire steps regarding one of our models
base-uncased-sentiment-snappfood
the procedure is as follow:
- Load the packages
- Load the config, tokenizer, and the model
- The inference
0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.
Step 0
!pip install -qU transformers !mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood !wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json !wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 !wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt !ls /content/bert-fa-base-uncased-sentiment-snappfoodOutput
config.json tf_model.h5 vocab.txt
Step 1
from transformers import TFBertForSequenceClassification from transformers import AutoConfig from transformers import AutoTokenizer import tensorflow as tf import numpy as npStep 2
config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/') model.summary()Output
All model checkpoint weights were used when initializing TFBertForSequenceClassification. All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training. Model: "tf_bert_for_sequence_classification" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= bert (TFBertMainLayer) multiple 162841344 _________________________________________________________________ dropout_37 (Dropout) multiple 0 _________________________________________________________________ classifier (Dense) multiple 1538 ================================================================= Total params: 162,842,882 Trainable params: 162,842,882 Non-trainable params: 0 _________________________________________________________________
Step 3
prompt = 'این خوراک بسیار خوب است' inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True) logits = model(inputs)[0] outputs = tf.keras.backend.softmax(logits) prediction = tf.argmax(outputs, axis=1) prediction = prediction[0].numpy() scores = outputs[0].numpy() labels = config.id2label print(scores) print(labels[prediction])Output
[0.9952093 0.00479068] HAPPY
Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.
I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive?
hello mehrdad
can you help me again?
What parameters can I change to get better accuracy?
I changed these parameters to some extent:
MAX_LEN = 64
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16
EPOCHS = 10
EEVERY_EPOCH = 500
LEARNING_RATE = 2e-5
CLIP = 0.0
But my lossـvalidation value increased and the accuracy decreased
How can I increase my accuracy to have a better forecast(predict)?
Epoch 1/10
3260/3260 [==============================] - 1011s 310ms/step - loss: 0.3203 - accuracy: 0.8780 - val_loss: 0.2650 - val_accuracy: 0.9039
Epoch 2/10
3260/3260 [==============================] - 1012s 310ms/step - loss: 0.1934 - accuracy: 0.9294 - val_loss: 0.2776 - val_accuracy: 0.9107
Epoch 3/10
3260/3260 [==============================] - 1013s 311ms/step - loss: 0.1207 - accuracy: 0.9580 - val_loss: 0.3280 - val_accuracy: 0.8958