Using SentEval to evaluate the word vectors and hidden vectors of pretrained models for natural language processing.
To aggregate the hidden vectors of a sentence to a single vector, four methods are used: using the vector at the first position ([CLS] symbol), using the vector at the last position ([SEP] symbol), using the average across the sentence, and using the maximum across the sentence.
There are 12 layers in the bert-base-uncased model, denoted as 1-12. The word vectors are also included, denoted as 0. Modified codes from huggingface and pretrained model from goolge are used. The colors in the heatmaps are normalized to show the trend across layers.
SkipThought-LN < GloVe BoW < fastText Bow < InferSent < Char-phrase. The higher, the better, except for metrics with suffix m, i.e., MSE.
Some NaNs are encountered at the word embedding layer (white cells). L4 or L5 are the best performing layers, but are still not favorable.
Results are better than first. Best performing layers spread from L4 to L8. Notice that the last layer (L12) performs worse than the last layer of first.
Good performance for L1, and L2, much better than first and last. They are the bottom layers.
Good performance for L1 and L2, better than max. In general, better than fasttext BoW, on par with InferSent. They are almost the bottom layers.
GloVe < fastText < SkipThought < InferSent
The higher, the better. Good performance, almost as good as SkipThought. Maybe the next sentece prediction task is essential for BERT?
The best (L9) is better than first.
Not as good as last.
Better than first and last, arguably better than SkipThought.
fastText BoW < NLI < SkipThought (except that SkipThought is realy bad at WC) < AutoEncoder < NMT < Seq2Tree. These reflect linguistic properties.
Not good.
Worse.
Not good.
Best performing. But still lag far behind NMT or Seq2Tree, on par with NLI pretraining of BiLSTM and GatedConvNet.
- BERT's downstream tasks are all classification tasks, so maybe it is not surprising to see its hidden layers without task specific fine-tuning is also very good at classification.
- BERT's [CLS] vector and [SEP] vector embody a lot of information, maybe too much information. It may root from the next sentence prediction task, which makes use of the two vectors and which makes SkipThought succesful at classification task as well. (How does masked language modelling function here?)
- BERT's hidden representation is not as good in Semantic Relatedness or Linguistic Properties as in Text Classification, similar to SkipThought or NLI pretraining. These aspects seem unimportant for classification tasks or extraction tasks; however, they may be important for generation.
- From early observation on BERT attention distribution, it is also found that only the attention of the bottom layers are spread over the whole sentence and in higher layers, the tokens most attend to [CLS] and [SEP] symbols. It is possible that only the bottom layers are gathering and combining semantics, as shown by the semantic relatedness results. [CLS] and [SEP] become actual sentence representations in higher layers. Task-specific fine-tuning may change that, but how much? (BERT-base-uncased seems overfitting too fast on SQuAD fine-tuning.)
There are 24 layers in the bert-large-uncased model, denoted as 1-24. The word vectors are also included, denoted as 0. Modified codes from huggingface and pretrained model from goolge are used. The colors in the heatmaps are normalized to show the trend across layers.
SkipThought-LN < GloVe BoW < fastText Bow < InferSent < Char-phrase
Some NaNs are encountered at the word embedding layer (white cells). L7-L9 are the best performing layers, but are still not favorable.
Results are the worst. Best performing layers spread from L6-L7 and L14-L15. Notice that the last layer (L24) performs better than the last layer of first.
Good performance for L1-L3 and L9. They are the bottom layers.
Good performance for L1-L6, better than max. In general, better than InferSent. They are almost the bottom layers.
GloVe < fastText < SkipThought < InferSent
The higher, the better. Good performance. Really good at binary classification, better than InferSent. For others, as good as SkipThought.
Worse than first.
Better than last.
The pattern indicates most useful information is contained in first and last.
fastText BoW < NLI < SkipThought (except that SkipThought is realy bad at WC) < AutoEncoder < NMT < Seq2Tree. These reflect linguistic properties.
Not good.
Worse.
Not good.
Best performing. But still lag far behind NMT or Seq2Tree, on par with NLI pretraining of BiLSTM and GatedConvNet. Notice the WC has positive correlation with most downstream tasks.
- More layers do not learn better linguistic patterns or semantic relatedness per se. However, they do improve the performance of first or last methods.
- More layers do improve classification results, by about 2 absolute points. The result pattern is almost the same among first, last, max, and mean for each task across layers. Especially, considering the average performance across tasks, the best results are always achieved by near the top layers.