Training/Dev/Test split: splitforgeneval vs. training-triples

Question

Training/Dev/Test split: splitforgeneval vs. training-triples

guangsen-wang opened this issue 2 years ago · comments

Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example,

for paralex:

wikianswers-para-splitforgeneval
training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)

for qqp:

qqp-splitforgeneval
training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/

I also found there might be potential "overlaps" between train and test sets under the same folder, for example,

grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}

VS.

grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}

My questions are

What is the relationship between qqp-splitforgeneval and training-triples?
if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances)
is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)

Thanks and I appreciate your help.

Tom Hosking · Answer 1 · Wed May 11 2022 22:52:34 GMT+0800 (China Standard Time)

Hi, thanks for you interest in our project! And thanks for noticing that the dataset name is different to the config, I will check that.

The four datasets are for different purposes:

qqp-clusters is the starting point - this contains the paraphrase clusters, determined by collecting paraphrase pairs from the original datasets (and also solving the triangle equality, so if (A,B) and (B,C) appear in the original training data, then (A,B,C) form a paraphrase cluster). I used the original datasets to work out all the clusters, then split into train/dev/test.
qqp_allqs is a 'flattened' version of all the sentences, for internal use.
training-triples/* is a resampled version of qqp-clusters, that pairs up sentences with paraphrases and exemplars, to use for training HRQ-VAE. You can ignore this, unless you want to retrain my model.
qqp-splitforgeneval is the same set of sentence clusters as qqp-clusters, but for each cluster I've chosen one sentence to use as input (sem_input) and used the rest as reference outputs (paras). So e.g. ${x_0,x_1,x_2,x_3}$ becomes $x_2 -> {x_0,x_1,x_3}$.

So, tldr, to evaluate your model, you should use qqp-splitforgeneval, with the sem_input as input and the paras as references.

It should not be possible for the same sem_input to appear in both train/test. For Paralex, the clusters were created by comparing strings, so should definitely not happen. For MSCOCO, I used the public train/dev/test splits, so again there should not be duplication. For QQP, I used the question IDs to do the clustering, so if the same question appears twice with different IDs then it's possible for it to appear twice. But, please let me know if you find many more examples and I can double check.

Guangsen Wang · Answer 2 · Wed May 11 2022 23:10:46 GMT+0800 (China Standard Time)

Thanks for the quick reply.

grep -f qqp-splitforgeneval/test_inputs.txt training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl produces a large number of utterances that appear in both the test and train set (again, qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100 is not the same as the one in config (N26)).

Even under the same folder:

grep 'Would Muhammad Ali beat Bruce Lee?' qqp-splitforgeneval/test.jsonl
{"tgt": "Who would win a fight Bruce Lee or Muhammad Ali?", "syn_input": "Who would win a fight Bruce Lee or Muhammad Ali?", "sem_input": "Would Muhammad Ali beat Bruce Lee?", "paras": ["Who would win a fight Bruce Lee or Muhammad Ali?"]}
grep 'Would Muhammad Ali beat Bruce Lee?' qqp-splitforgeneval/train.jsonl
{"tgt": "Would Muhammad Ali beat Bruce Lee?", "syn_input": "Would Muhammad Ali beat Bruce Lee?", "sem_input": "Who would win in a fight, Bruce Lee or Muhammad Ali?", "paras": ["Who's better: Bruce Lee or Muhammad Ali?", "Would Muhammad Ali beat Bruce Lee?", "Who would win a fight Bruce Lee or Muhammad Ali?"]}

Am I missing something?

Tom Hosking · Answer 3 · Wed May 11 2022 23:23:13 GMT+0800 (China Standard Time)

Thanks for bringing this to my attention - this shouldn't be happening! I will check the code that I used to construct the QQP splits and get back to you. The other two datasets look OK, though.

Guangsen Wang · Answer 4 · Wed May 11 2022 23:33:26 GMT+0800 (China Standard Time)

thanks! For Paralex dataset, there are also 236 utterances out of 27778 that appear in both wikianswers-para-splitforgeneval/test.jsonl and training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl, such as

grep 'what was the name of sacagawea parents' training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train.jsonl

grep 'what was the name of sacagawea parents' wikianswers-para-splitforgeneval/test.jsonl
{"tgt": "what is sacagawea fathers name ?", "syn_input": "what is sacagawea fathers name ?", "sem_input": "what was the name of sacagawea parents ?", "paras": ["sacagawea fathers name ?", "where did sacagawea family live ?", "sacagawea major childhood events ?", "what is sacagawea dads name ?", "what is sacagawea fathers name ?"]}

Tom Hosking · Answer 5 · Thu May 12 2022 01:23:22 GMT+0800 (China Standard Time)

For QQP, I used some pre-existing train/dev splits and further split dev in dev+test - it looks like unfortunately these splits had overlapping questions! I also can't find exactly where I sourced the splits from.

For Paralex, it's possible there was a bug in my code to build the clusters that meant some clusters were not combined despite having the same sentences. Thanks for drawing both of these to my attention!

Note that all the results reported in the paper used the same dataset splits - so the test scores are probably slightly higher than they should be (due to the train/test leak) but will have affected all the models, so the overall conclusions are still valid.

If you want to create new splits for both datasets I'd be happy to retrain my model and report the updated results? I can also share the code I used to construct the Paralex clusters, if that would be useful?

Guangsen Wang · Answer 6 · Thu May 12 2022 09:00:13 GMT+0800 (China Standard Time)

Hi, thanks so much for the clarification, really appreciated it. Your clustering code for Paralex would be definitely helpful.

Just to be a little bit more precise, the numbers of train/test leaks are: Paralex 236/27778, MSCOCO 48/5000, QQP 1642/5225. Therefore the impacts to Paralex and MSCOCO are probably negligible. However, for QQP, the results are highly biased as almost 1/3 of the test utterances appear in the training data. What I am planning to do is to remove all the test utterances from the training set and retrain my model: 1) discard all lines that contain test utterances in training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/train/dev.jsonl (sem_inputs, tgt, syn_input) 2) select all unique sem_input and tgt pairs as the new training data for training my own model 3) evaluate on the original test set. Is this a fair setup compared to your model training pipeline? Thanks

Tom Hosking · Answer 7 · Thu May 12 2022 18:22:33 GMT+0800 (China Standard Time)

Yes, that sounds like a sensible approach. I have also added a 'deduped' version of the datasets here that you can use directly - I've removed any instances from the training data that overlap at all with dev or test. I'll also retrain my model on this dataset to check what impact the leak has.

Tom Hosking · Answer 8 · Thu May 12 2022 19:01:58 GMT+0800 (China Standard Time)

I've remembered where the QQP splits came from originally - they're the splits provided by GLUE.

Guangsen Wang · Answer 9 · Thu May 12 2022 22:15:48 GMT+0800 (China Standard Time)

Thanks, really appreciate the effort. I will definitely try the 'deduped' qqp. Looking forward to your new results on this set as well.

Tom Hosking · Answer 10 · Fri May 13 2022 17:38:32 GMT+0800 (China Standard Time)

The updated HRQ-VAE results on the deduped set are (BLEU/self-BLEU/iBLEU): 30.53/40.22/16.38. So it's true that it does take a performance hit, but these scores are still higher than all the other comparison systems (even when they're trained on the leaky split). So, I'm not concerned about the conclusions in the paper. But I agree it would be better to use the deduped splits going forward :)

hahally · Answer 11 · Wed Sep 14 2022 17:35:29 GMT+0800 (China Standard Time)

Hello, When I trained on the MSCOCO dataset, the result was much worse. Bleu was only 8.x. Why? Including the BTMPG comparison model.

Tom Hosking · Answer 12 · Wed Sep 14 2022 17:37:58 GMT+0800 (China Standard Time)

Hi @hahally, is your issue related to overlap between train/test splits or is it a different problem?

hahally · Answer 13 · Wed Sep 14 2022 20:31:35 GMT+0800 (China Standard Time)

Thanks for the quick reply.

It may be a different problem.
I try to reproduce the experimental results, but Bleu is always low on MSCOCO data, and it performs normally on quora data. I don’t know why?

Tom Hosking · Answer 14 · Wed Sep 14 2022 20:33:19 GMT+0800 (China Standard Time)

@hahally Please open a separate issue, and provide details of how you are running the model and performing evaluation. Thanks.