princeton-nlp / MABEL

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ask for datasets help

yqw0710 opened this issue · comments

Hi,I read your article and found that the experimental results were very effective. I learned that your training datasets came from MNLI and SNLI, but I didn't find the specific preprocessing steps. Could you please provide the preprocessed code?Thank you very much!

Hey!
Did you manage to find the preprocessing steps for generating the dataset? If so, could you please point me to them?
Thanks!

The dataset (as linked in the README) can be found here.

I have since graduated and no longer have access to the original preprocessing script, but I'd imagine that recreating it should be pretty simple. You can download the SNLI or MNLI dataset from HuggingFace, and filter for the entailment pairs (so the premise and hypothesis would be orig_sent0 and orig_sent1, respectively). Then, apply CDA using the 10 or so gender word pairs listed in Appendix A of the paper - the gender-flipped premise and hypothesis would then be aug_sent0 and aug_sent1, respectively. Finally, the column for both is 1 if both the premise and the hypothesis have gendered words that are flipped, and 0 otherwise (this affects the computation of one of the losses).

Let me know if you have any other questions.