kermitt2 / delft

a Deep Learning Framework for Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparison version 0.2.6 and 0.3.0 with scibert

lfoppiano opened this issue · comments

I've made several tests with scibert trying to keep the same conditions between the two version of delft.

# delft version run architecture batch size max seq length max epoch F1
1 0.2.6 24142 scibert 6 512 50 0.8332
2 0.2.6 24063 scibert 6 512 50 0.8327
3 0.3.0 24141 BERT 20 512 60+early stop 0.8134
4 0.3.0 24138 BERT_CRF 20 512 60+early stop 0.8092
7 0.3.0 24136 BERT_CRF 20 512 60+early stop 0.8173
5 0.3.0 24146 BERT_CRF 20 512 15 0.8137
6 0.3.0 24145 BERT 20 512 15 0.8178
8 0.3.0 24147 BERT_CRF 6 512 50 0.8327
9 0.3.0 24148 BERT 6 512 50 0.8325

run 2 and 4 are repetition to be sure the results are consistent.

The dataset is the same, details here:

8167 train sequences
908 validation sequences
1009 evaluation sequences

I could try to reduce the batch-size for delft 0.3.0 but I doubt that would make any difference

early stop was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?

I would change the max epoch as first try.

The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10.
It might depend on the training size I guess. What is the size of this training set?

With higher nb of epoch, you could try to decrease the learning rate too.

On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.

early stop was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?

Ah ok, I updated the two rows then

I would change the max epoch as first try.

The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10. It might depend on the training size I guess. What is the size of this training set?

The training set is

8167 train sequences
908 validation sequences
1009 evaluation sequences

I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?

With higher nb of epoch, you could try to decrease the learning rate too.

On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.

I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?

I would try 5 to see, but I had good results with 15. Maybe decrease batch size to 6 to check too. It's unexpected to see lower score with CRF, normally it improves a bit. It's possible to use BERT_ChainCRF as implementation variant to double check.

Note sure it's useful but, to be sure to use all training available with BERT, note that early_stop parameter is by default true, so you have to set early_stop to false explicitly before training to be sure it's not used. Then both the train and validation set will be used for training until the max_epoch is reached.

OK. It seems that the results are comparable 🎉 , see runs 8 and 9.

After a lot of tries, with the new version I obtain the best result using early_stop=True for architectures using BERT, despite some part of the training data is used to check the stop criteria.

With early_stop=True for me was not the case, did you use any special parameter?

I tested again with the latest changes. With early_stop=True I get worst results unfortunately..

Here the comparison between early_stop=True and early_stop=False for DeLFT 0.3.0.

# run architecture transformer batch size max seq length max epoch early_stop F1
1 24304 BERT_CRF scibert_cased 20 512 60 False 82.99
2 24311 BERT_CRF scibert_cased 20 512 60 True 81.44
3 24305 BERT scibert_cased 20 512 60 False 82.73
4 24312 BERT scibert_cased 20 512 60 True 81.70
5 24307 BERT_CRF_FEATURES scibert_cased 20 512 60 False 83.31
6 BERT_CRF_FEATURES scibert_cased 20 512 60 True
7 24418 BERT_CRF_FEATURES scibert_cased 20 512 10 False 81.29
8 24303 BERT_CRF matscibert 20 512 60 False 82.88
9 24313 BERT_CRF matscibert 20 512 60 True 81.52

Could be that the 1000 validation examples are making such difference?

I'm closing for the moment, as I've managed to obtain the same results with delft 0.3.0 as I had before 🎉