The reason why pretrain does work.
guotong1988 opened this issue · comments
Do you think that it is mainly because:
BERT is Bidirectional, and CNN could also have the same function?
@brightmart Thank you!
no. it is not related to BERT or CNN.
a model, no matter which actually it is, pre-train stage is able to learn most of parameters for that model during a pre-train task, which is a kind of supervised. so during fine-tuning, you only need to learn few parameters, such as parameter for last layer as classifier.