princeton-nlp / LM-BFF

Now, I can see that there was another issue similar to this. However, I am still not clear on how to deal with OOD Test Data.

I want to train and validation on original train.tsv and dev.tsv in the folder ORIGINAL. But, I want to test on an out of distribution dataset.

So, let's say I want to test SST-2 on IMDB for roberta-base. How should I go about it? Currently, I replace test.tsv in ORIGINAL folder and generate K shot data. The I run the file using the commands given on README on the repo page. However, the test eval accuracy is the same as the original SST-2 test dataset. I don't know what is happening here. To reiterate:

My objective:

Test IMDB on roberta-base 42 seed SST-2. But train and validate on original data provided with repo.

Action:

Replace test.tsv of ORIGINAL SST-2 with IMDB.

Observed Behaviour:

Same test eval accuracy as original one as if not replaced test.tsv.

Expected Behaviour:

Same test and dev accuracy, different test accuracy.

Request:

Please help :) We changed the original test.tsv and then generated K shot again, but there was no change.

Hi,

Make sure your cache files are either deleted, or you use a completely separate data directory/file naming from the original, or you specify the cache overwrite flag. The data loader will load existing cached torch files if --overwrite_cache is not set. Which is the default.

Reference:

LM-BFF/src/dataset.py

Line 318 in 1bbdc42

    
           # Cache name distinguishes mode, task name, tokenizer, and length. So if you change anything beyond these elements, make sure to clear your cache.

Ok, I will delete the cache directories @ajfisch. So I should replace the test.tsv files in the original folder for all tasks in a similar manner?

Yes, either deleting existing cache files (and then the code would overwrite the missing file), or saving the alternate data to a new data directory (so then the cache files would be saved and loaded from new_data_dir/<cache_file_name>) should work.

Hi,
Thanks again for the great work.

Today I actually encountered the same error as issue #7 ., when testing a model prompt-tuned on SST-2 directly on imdb movie review dataset, by replacing the dev.tsv in /original with the imdb dataset, as mentioned in issue #14 .

What I did:

prompt tune a model ckpt on SST-2, and save the model
replace the data/original/SST-2/dev.tsv with my own imdb dataset, and format it correctly
run tools/generate_k_shot.py again. The data/k-shot/SST-2/test.tsv turns to imdb.
load the model in 1) and put --no_train, --do_predict, --overwrite_cache, and other necessary flags to zero-shot on the imdb dataset. I also cleared the cache before I run it.
Error occurs.
Traceback (most recent call last):
File "run.py", line 628, in
main()
File "run.py", line 466, in main
if training_args.do_predict
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 465, in init
verbose=True if _ == 0 else False,
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 585, in convert_fn
other_sent_limit=self.args.other_sent_limit,
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 243, in tokenize_multipart_input
mask_pos = [input_ids.index(tokenizer.mask_token_id)]
ValueError: 50264 is not in list
This "50264" is the same error as in issue #7
Sorry for the inconvenience but do you happen to know what might went wrong?

Many thanks.

Making another issue, since new error is different.

你好，我是软件学院胡剑。我已收到你的邮件，尽快给你回复。

Testing: New Data for GLUE Tasks