DFKI-NLP / thermostat

Collection of NLP model explanations and accompanying analysis tools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MNLI and XNLI downstream model performance very low

nfelnlp opened this issue · comments

IMDb and AG News accuracies when comparing true labels to predicted labels are reasonably high.
However, almost all MNLI and XNLI models investigated so far have an extremely low accuracy.

Here are some random configs that I investigated:
multi_nli-albert-occ : 6.48%
multi_nli-roberta-occ : 28.72%
multi_nli-xlnet-occ : 5.53%
multi_nli-bert-lime : 6.46 %
xnli-bert-lig : 7.05%

Curiously, the ELECTRA model (submitted by a different person than all the other models) are not affected:
multi_nli-electra-lgxa : 88.7%
xnli-electra-lig : 88.3%

It's not clear to me yet if this is simply an issue stemming from an outdated label order as documented here:
huggingface/transformers#10203

However, the label orders that are mentioned there, both old and new, are also different from the label assignment given by datasets: https://huggingface.co/datasets/viewer/?dataset=xnli (multi_nli has the same labels)

Note that the datasets version we use in thermostat is 1.5.0, so this might be a problem with the downstream models predicting different labels.

Then I investigated in what way the predictions and labels of such a faulty model (xnli-bert-lgxa) align:

import thermostat
from collections import Counter

bert_lgxa = thermostat.load("xnli-bert-lgxa")
true_pred_comp_bert_lgxa = [(b_i.true_label['index'], b_i.predicted_label['index']) for b_i in bert_lgxa]
Counter(true_pred_comp_bert_lgxa)

>>> Counter({(2, 0): 1488,
         (0, 1): 1345,
         (1, 0): 153,
         (1, 2): 1411,
         (0, 2): 231,
         (0, 0): 93,
         (1, 1): 107,
         (2, 2): 153,
         (2, 1): 29})

This leads me to assume that (2, 0), (0, 1) and (1, 2) are actually the correct predictions and (0, 0), (1, 1) and (2, 2) are wrong ones. If we sum up (2, 0), (0, 1) and (1, 2) and divide it by the sum of all (or the length of the XNLI dataset), we end up at 84.71% which is a much more reasonable number in my opinion.

At the very least, this means that all thermostat subsets concerning MNLI and XNLI need to be redone (editing JSONL files and reuploading).
Hopefully, this only means going through each JSONL and changing the values either by:

  1. Changing the true labels to the old standard. However, this means that we do not use the vanilla data from datasets anymore.
  2. Changing the predicted labels as well as the logits.

I'm pretty positive that we don't need to run the explanations again.


On a sidenote, I also considered the encode_pair function (which is only used by MNLI and XNLI in thermostat) not working correctly, but couldn't find any reference implementation stating that the way the two text fields are ingested might be wrong.

Hot fix has been implemented by adding distinct values for label_classes in the config:

https://github.com/nfelnlp/thermostat/blob/30dac0b5a2c9ad2931e34d14ffb7b58e9f5200b4/src/thermostat/data/thermostat_configs.py#L81-L84

vs

https://github.com/nfelnlp/thermostat/blob/30dac0b5a2c9ad2931e34d14ffb7b58e9f5200b4/src/thermostat/data/thermostat_configs.py#L86-L89

and then changing the dataset label column and label_names:

https://github.com/nfelnlp/thermostat/blob/30dac0b5a2c9ad2931e34d14ffb7b58e9f5200b4/src/thermostat/data/dataset_utils.py#L111-L116

I'll close this for now, but the right way to go would probably be fixing the label order when loading the dataset before running the explanation job.