ARCD AraBERTv0.1 Results

Question

ARCD AraBERTv0.1 Results

YousefGh opened this issue 4 years ago · comments

I think the reported results on ARCD using AraBERTv0.1 are falling into a data leakage problem. I have replicated the same new pipeline you have (using arcd_preprocessing.py) in which I got these results:

Results: {'exact': 31.623931623931625, 'f1': 67.4479996189414, ..}

Which are very similar to Replicate SQuAD results #30 reported by alyafeai After that, I looked at one of the issues: Question Answering training data #23 where you've added this code snippet:

from SOQAL.data_helpers.data_split import train_test_split, combine_json_files

train_test_split("SOQAL/data/arcd.json",0.5)
combine_json_files(["SOQAL/data/Arabic-SQuAD.json","SOQAL/data/arcd-test.json"])

where you have combined arcd-test.json with Arabic-SQuAD.json to produce turk_combined and use arcd_preprocessing.py to get turk_combined_all_pre.json and arcd-test-pre.json like this:

python arcd_preprocessing.py \
    --input_file="/PATH_TO/arcd-test.json" \
    --output_file="arcd-test-pre.json" \
    --do_farasa_tokenization=True \
    --use_farasapy=True

python arcd_preprocessing.py \
    --input_file="/PATH_TO/turk_combined.json" \
    --output_file="turk_combined_all_pre.json" \
    --do_farasa_tokenization=True \
    --use_farasapy=True

The problem here is that you have combined the testing set arcd-test.json with Arabic-SQuAD.json which are used for fine-tuning. And then tested on data that are used for fine-tuning, arcd-test.json . Of course, these are speculations as you might've mistakenly put arcd-test.json instead of arcd-train.json in the reply only but not the actual code. So, I have purposely leaked the testing dataset with Arabic-SQuAD.json as what the code snippet above does and got:

Results: {'exact': 49.14529914529915, 'f1': 80.06334012841286, ..}

which are very similar to the reported results for AraBERTv0.1 on ARCD. Can you please check if I'm missing something

Wissam Antoun · Answer 1 · Fri Nov 13 2020 01:27:28 GMT+0800 (China Standard Time)

Yes, it seems that I actual mistakenly combined the test json instead of the training.

Thank you for the notice, I will update the results in the table asap.