Reweighting synthetic examples
This repository is the code for our paper, "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity" (COLING2022) [paper].
How to Begin
- Install required packages in
requirements.txt
. - Download preprocessed benchmark datasets (STSb, QQP, and MRPC) from this drive link.
- Prepare PAWS-QQP dataset following this repository, and locate it in
datasets/benchmarks/paws/
.
How to Reproduce
1. Data preparation
- Run
scripts/0_preprocessing.sh
script. This will prepare sentences (C_src) to make synthetic dataset, and split PAWS dataset into dev and test splits.
2. Synthetic dataset generation & Machine-written example identification
- Run
scripts/1_generation.sh
script to generate synthetic examples and train a discriminator model that identifies them. - A process to create synthetic dataset is same with the original DINO framework suggested by Schick et al. (2021).
3. Training and evaluating STS models
- Run
scripts/2_run_sts.sh
to train bi-encoder models for sentence similarity tasks. - The shell script is to reproduce all results in Table 2 (reweighting or not, ablation study).
4. Other baseline models
- Run
scripts/3_run_other_baselines.sh
to reprduce the results of other baseilne models in Table 6, such as GloVe, BERT, and USE.
Acknowledge
Codes to generate synthetic dataset are derieved from Schick et al. (2021)'s work. (Github)