Reweighting synthetic examples

This repository is the code for our paper, "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity" (COLING2022) [paper].

How to Begin

Install required packages in requirements.txt.
Download preprocessed benchmark datasets (STSb, QQP, and MRPC) from this drive link.
Prepare PAWS-QQP dataset following this repository, and locate it in datasets/benchmarks/paws/.

How to Reproduce

1. Data preparation

Run scripts/0_preprocessing.sh script. This will prepare sentences (C_src) to make synthetic dataset, and split PAWS dataset into dev and test splits.

2. Synthetic dataset generation & Machine-written example identification

Run scripts/1_generation.sh script to generate synthetic examples and train a discriminator model that identifies them.
A process to create synthetic dataset is same with the original DINO framework suggested by Schick et al. (2021).

3. Training and evaluating STS models

Run scripts/2_run_sts.sh to train bi-encoder models for sentence similarity tasks.
The shell script is to reproduce all results in Table 2 (reweighting or not, ablation study).

4. Other baseline models

Run scripts/3_run_other_baselines.sh to reprduce the results of other baseilne models in Table 6, such as GloVe, BERT, and USE.

Acknowledge

Codes to generate synthetic dataset are derieved from Schick et al. (2021)'s work. (Github)

About

Official repository for "Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity (COLING2022)"

Languages

Language:Python 96.5%Language:Shell 3.5%