Fine-Tune on our dataset

Question

Fine-Tune on our dataset

miladfa7 opened this issue 4 years ago · comments

Milad Farzalizadeh commented 4 years ago

how to fine-tune ParseBERT Model on our dataset?
Please help me ...
thanks

Mehrdad Farahani · Answer 1 · Sun Sep 13 2020 22:05:44 GMT+0800 (China Standard Time)

You can use this Colab to fine-tuning your dataset based on the text classification tasks. For other down-stream tasks, I'm afraid to say that you need to be patient. I'll add others soon!

tannazhp74 · Answer 2 · Fri Sep 25 2020 23:15:18 GMT+0800 (China Standard Time)

how to get embedding of ParsBert pretrain model?

MinaRezaee · Answer 3 · Sat Oct 03 2020 16:37:58 GMT+0800 (China Standard Time)

I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the error

raise ValueError ('No model found in config file.')
ValueError: No model found in config file.

How can I add a config file so that I do not get this error?

Mehrdad Farahani · Answer 4 · Sat Oct 03 2020 17:02:59 GMT+0800 (China Standard Time)

I used this Bert Persian classification model
Model saved with a config
But I want to load that model separately and predict label sentence with model
It gets the error

raise ValueError ('No model found in config file.')
ValueError: No model found in config file.

How can I add a config file so that I do not get this error?

Did you fine-tune parsbert on your dataset? Which methods did you use (PyTorch, TensorFlow, Script)? Did you save your model except for the script technique (What type of files do you have on your saved model directory)?

MinaRezaee · Answer 5 · Sat Oct 03 2020 17:09:26 GMT+0800 (China Standard Time)

https://github.com/hooshvare/parsbert/blob/master/notebooks/Taaghche_Sentiment_Analysis.ipynb

I made my model from this link
Yes, I fine-tuned my data and i have 3 labels
I have two saved files in pytorch_model.bin
Named: tf_model.h5 and config.json
I load my model this way
from keras.models import load_model
model = keras.models.load_model ('tf_model.h5')

Mehrdad Farahani · Answer 6 · Sat Oct 03 2020 17:15:05 GMT+0800 (China Standard Time)

Ok, then. Your model fine-tuned on Transformers you can't load the model just as simple as Keras...load you must load your fine-tuned model using transformers
if you have tf_model.h5 on your saved directory, use this:

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY)

otherwise, if you have pytorch_model.bin

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained(YOURSAVED_DIRECTORY, from_pt=True)

Also, make sure you have config.json and vocab.txt in your directory!

MinaRezaee · Answer 7 · Sat Oct 03 2020 17:25:25 GMT+0800 (China Standard Time)

With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txt

from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')

This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

MinaRezaee · Answer 8 · Sat Oct 03 2020 17:32:33 GMT+0800 (China Standard Time)

With my model, only these two files are saved
Named: tf_model.h5 and config.json
I do not have this file vocab.txt

from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained('tf_model.h5')

This is how I got this error
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Link used
Nothing called this file saved

Mehrdad Farahani · Answer 9 · Sat Oct 03 2020 17:44:54 GMT+0800 (China Standard Time)

First of all, you can download vocab.txt from here.

https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txt

Secondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of bert-fa-cls-base-uncased and it includes:

+ bert-fa-cls-base-uncased
    - config.json
    - vocab.txt
    - tf_model.h5

you need to pass the directory, not the model singly; I mean, load your model using this piece of code:

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")

MinaRezaee · Answer 10 · Sun Oct 04 2020 19:04:10 GMT+0800 (China Standard Time)

First of all, you can download vocab.txt from here.
https://cdn.huggingface.co/HooshvareLab/bert-base-parsbert-uncased/vocab.txt
Secondly, you must load the model from the saved directory, not just the h5 model! supposed that I have a directory with the name of bert-fa-cls-base-uncased and it includes:
+ bert-fa-cls-base-uncased
    - config.json
    - vocab.txt
    - tf_model.h5
you need to pass the directory, not the model singly; I mean, load your model using this piece of code:
from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained("./bert-fa-cls-base-uncased/")

please help me
i get input sentence and i want predict label from the save model
But I think the preprocessing part and the pad of my sentence were done wrong
Because I can not predict label

from transformers import BertConfig, BertTokenizer
MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-base-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
sample_comment= "شعار ما هوش مصنوعی برای همه است"
max_length=32
tokens = tokenizer.tokenize(sample_comment, padding=True, max_length=42)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
from transformers import TFAutoModelForSequenceClassification
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./pytorch_model.bin/")
predictions = tf_model.predict(token_ids)
print(predictions)

Mehrdad Farahani · Answer 11 · Sun Oct 04 2020 20:58:47 GMT+0800 (China Standard Time)

The whole process is as simple as you think! but before dive into it, we need to set some grounds

The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

Load the packages
Load the config, tokenizer, and the model
The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

MinaRezaee · Answer 12 · Sun Oct 04 2020 21:20:12 GMT+0800 (China Standard Time)

The whole process is as simple as you think! but before dive into it, we need to set some grounds

The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

Load the packages
Load the config, tokenizer, and the model
The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

Thank you very much for your help

MinaRezaee · Answer 13 · Mon Oct 05 2020 21:38:28 GMT+0800 (China Standard Time)

Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.

I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive

اصلاح طلبی باید راهبرد راهگشایی برای برونرفت حاکمیت ازین بن بست سیاسی چهل ساله که ریشه همه یا کمینه بیشتر مشکلات کنونی کشوراست ارائه کند وانهم تناقض وتنافر بزرگ حاکمیتی یعنی جمهوریت وولایت مطلقه است تا موضع وسمت وسوی خود را شفاف وبوضوح بیان نکند ازاصلاح طلبی فقط همان نامش را یدک میکشد,positive,political

label sentence is a positive but predict political

my model have 3 labels :

label2id: {'negative': 0, 'political': 1, 'positive': 2}

id2label: {0: 'negative', 1: 'political', 2: 'positive'}

How do you think I can improve the accuracy and floss of the model to have a better prediction?

MinaRezaee · Answer 14 · Tue Oct 20 2020 14:13:38 GMT+0800 (China Standard Time)

The whole process is as simple as you think! but before dive into it, we need to set some grounds

The fine-tuned model saved on a directory in this case bert-fa-base-uncased-sentiment-snappfood
The directory consists of these properties: config.json tf_model.h5 vocab.txt

I'm going to demonstrate the entire steps regarding one of our models base-uncased-sentiment-snappfood the procedure is as follow:

Load the packages
Load the config, tokenizer, and the model
The inference

0 + and a preliminary step regarding the mentioned model, in your case you don't need to this part.

Step 0

!pip install -qU transformers

!mkdir -p /content/bert-fa-base-uncased-sentiment-snappfood
!wget https://s3.amazonaws.com/models.huggingface.co/bert/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/config.json -qO /content/bert-fa-base-uncased-sentiment-snappfood/config.json
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5 -qO /content/bert-fa-base-uncased-sentiment-snappfood/tf_model.h5
!wget https://cdn.huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood/vocab.txt -qO /content/bert-fa-base-uncased-sentiment-snappfood/vocab.txt

!ls /content/bert-fa-base-uncased-sentiment-snappfood

Output

config.json  tf_model.h5  vocab.txt

Step 1

from transformers import TFBertForSequenceClassification
from transformers import AutoConfig
from transformers import AutoTokenizer

import tensorflow as tf
import numpy as np

Step 2

config = AutoConfig.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
tokenizer = AutoTokenizer.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')

model = TFBertForSequenceClassification.from_pretrained('/content/bert-fa-base-uncased-sentiment-snappfood/')
model.summary()

Output

All model checkpoint weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the model checkpoint at /content/bert-fa-base-uncased-sentiment-snappfood/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  162841344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 162,842,882
Trainable params: 162,842,882
Non-trainable params: 0
_________________________________________________________________

Step 3

prompt = 'این خوراک بسیار خوب است'

inputs = tokenizer.encode(prompt, return_tensors="tf", max_length=128, padding=True, truncation=True)
logits = model(inputs)[0]
outputs = tf.keras.backend.softmax(logits)
prediction = tf.argmax(outputs, axis=1)
prediction = prediction[0].numpy()
scores = outputs[0].numpy()

labels = config.id2label
print(scores)
print(labels[prediction])

Output

[0.9952093  0.00479068]
HAPPY

Hello Mehrdad
I will help you again
The accuracy of the model for 170,000 data is 89%
What I think is not very good: the amount of loss_validation is that when the number of epochs increases, the amount of loss_valitiona increases.

I used 45300 test data for the model predicate
Unfortunately, I did not have a good prediction and recognizes many negative sentences as positive?

MinaRezaee · Answer 15 · Tue Oct 20 2020 15:04:23 GMT+0800 (China Standard Time)

hello mehrdad
can you help me again?
What parameters can I change to get better accuracy?
I changed these parameters to some extent:
MAX_LEN = 64
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16
EPOCHS = 10
EEVERY_EPOCH = 500
LEARNING_RATE = 2e-5
CLIP = 0.0

But my lossـvalidation value increased and the accuracy decreased

How can I increase my accuracy to have a better forecast(predict)?

MinaRezaee · Answer 16 · Tue Oct 20 2020 15:36:38 GMT+0800 (China Standard Time)

Epoch 1/10
3260/3260 [==============================] - 1011s 310ms/step - loss: 0.3203 - accuracy: 0.8780 - val_loss: 0.2650 - val_accuracy: 0.9039
Epoch 2/10
3260/3260 [==============================] - 1012s 310ms/step - loss: 0.1934 - accuracy: 0.9294 - val_loss: 0.2776 - val_accuracy: 0.9107
Epoch 3/10
3260/3260 [==============================] - 1013s 311ms/step - loss: 0.1207 - accuracy: 0.9580 - val_loss: 0.3280 - val_accuracy: 0.8958