the pre-trained MLM performance

Question

the pre-trained MLM performance

yyht opened this issue 6 years ago · comments

Hi, I tried to use your bert_cnn_model to train my corpus, 90W sentences and 30w words, the average length of sentence is 30 after tokenization. But the model seems to stuck on local minial that the accuracy on validation set just fluctuates after first 5-epoch

brightmart · Answer 1 · Wed Oct 31 2018 22:05:40 GMT+0800 (China Standard Time)

Hi. it's too few corpus to train on pretrain stage. i think you need millions sentences, at least one million.
it's easy to get raw data for pretrain stage, as long as each line contains a document or sentence(s).

it's also common sense to use lots of corpus to train on word embedding, same apply to pretrain language model.

let me know result after using lots of data for pretrain masked language model.

geogreff · Answer 2 · Thu Nov 01 2018 11:05:36 GMT+0800 (China Standard Time)

Hi, I tried your bert_model rather than bert_cnn_model. Bert_model could get about 75% F1 score on language model task. But using the pretrained bert_model to finetune on classification task, it didn't work. F1 score was still about 10% after several epoches. It is something wrong with bert_model?