n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with reproducing zero-shot learning results

blazejdolicki opened this issue · comments

I tried replicating results for zero-shot learning on CLS, but my results don't match those from the paper. Since the script for predicting labels with LASER seems not be a part of Multifit repository I trained LASER on the CLS dataset (only en and de books for now) by adjusting the MLDoc script from LASER repo to CLS. My fork of LASER with these adjustment is [here]h(ttps://github.com/blazejdolicki/LASER). For the time being I only tested on books in German. After some hyperparameter tuning performed on English training set, my best setup obtains 82.25% accuracy compared to 84.15% from the Multifit paper. My hyperparams are:

n_epochs=200
lr=0.001
wd=0.0
nhid="10 8"
drop=0.2
seed=1
bsize=12

and I'm using the last 10% of the test set as validation.
When I tried to make them more similar to Multifit (n_epochs=8, wd=0.001,bsize=18), the accuracy dropped to around 60%.

Afterwards, I used the best (82.25% acc) LASER classifier (trained on English training set) to predict labels for German books. Then I copied test, training and unsupervised sets in Multifit repo from folder de-books into de-books-laser and replaced ground truth labels in training set with pseudolabels. Afterwards I trained the Multifit classifier on those pseudolabels and while my validation accuracy isn't great but at least similar, my test set accuracy is as low as 70% (compared to 89.60 from the paper and here) as you can see in the attached logs.
Multifit CLS zero shot terrible results 15.04.2020.txt

I did expect some drop due to the issue explained in #63, but such big difference shows that the unsupervised set size can't be the only factor deteriorating the results. Other possible reason of the drop in performance that come to my mind are:

  • I used different hyperparameters for training and predicting LASER pseudolabels?
  • I used different train-dev split for training and predicting LASER pseudolabels?
  • your script was loading the LASER model with fastai library and training the classifier with it instead of Pytorch ?

My fork of mutlifit is here, I'm using the ulmfit-original-scripts branch.

I would really appreciate a reply :)

Hey Blazej, I updated the other issue with a solution, can you let me know if that fixed the issue or you still cannot reproduce the results?

Thanks for your response. Using more data helped to some extent, but after some more digging I realized the real issue. The CLS dataset has three columns - label, summary and the actual review text. Initially, in zero-shot learning I was discarding the summary column thinking it's irrelevant. All that adding the summary does is it increases the amount of data used for finetuning the LM. After I included the summary to my surprise the classification test results jumped by ~15%! Without "summary" column the LM had 60% (val) accuracy in the first epoch (out of 20) while with it it has an accuracy of 37%. Not sure why including summaries that are usually shorter than the main text makes such a difference. The LM training time per epoch also changed from 18 seconds to 2 mins and 23 seconds.

So currently my laser results are still ~2% lower than those from the paper and so are zero-shot learning multifit results. So it's just a matter of differences in my implementation of CLS on LASER and yours. Do you have access to the script that you used to train LASER on CLS? Would be great to compare hyperparameters and check if they are responsible for this difference.