noise-learning/SelfMix

SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

This repository contains the code and pre-trained models for our paper SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Quick Links

Overview
Datasets
- Noise Sample Generation
- Parameter Setting
Train
Evaluation
Baselines
Results

Overview

We proposes SelfMix, i.e., a self-distillation robust training method based on the pre-trained models.

SelfMix uses GMM to select the samples that are more likely to be wrong and erase their original labels. Then we leverage semi-supervised learning to jointly train the labeled set X (contains mostly clean samples) and an unlabeled set U (contains mostly noisy samples).

Datasets

We do experiments on three text classification benchmarks of different types, including Trec, AG-News and IMDB.

Dataset	Class	Type	Train	Test
Trec	6	Question-Type	5452	500
IMDB	2	Sentiment Analysis	45K	5K
AG-News	4	News Categorization	120K	7.6K

Noise Sample Generation

We evaluate our strategy under the following two types of label noise

Asymmetric noise (Asym): Following Chen et al., we choose a certain proportion of samples and flip their labels to the corresponding class according to the asymmetric noise transition matrix.
Instance-dependent Noise (IDN): Following Algan and Ulusoy, we train an LSTM classifier on a small set of the original training data and flip the origin labels to the class with the highest prediction.

You can construct noisy datsets by the following command (e.g., Trec 0.4asym),

python data/corrupt.py \
    --src_data_path data/trec/train.csv \
    --save_path data/trec/train_corrupted.csv \
    --noise_type asym \
    --noise_ratio 0.4

You can generate IDN dataset by the procedures:

train an LSTM classifier on a small set of the original training data
flip the origin labels to the class with the highest prediction according to the code repo

Parameter Setting

We use the following hyperparamters for training SelfMix:

Data Settings	Trec/AG-News(Asym)	IMDB(Asym)	AG-News/IMDB(IDN)
`lambda_p`	0.2	0.1	0.0
`lambda_r`	0.3	0.5	0.3
`class_reg`	False	False	True

Train

In the following section, we describe how to train a SelfMix model by using our code.

Requirements

You should run the following script to install the remaining dependencies first.

pip install -r requirements.txt

Quick Start

We list some demo config in folder demo_config, you can just use the demo config to train,

python train.py demo_config/trec-bert-asym_train.json

Parameters

Details about the meaning of parameters can be seen in our paper and dataclass ModelArguments, DataTrainingArguments and OurTrainingArguments in train.py

Evaluation

Similarly, you can run evaluation by the following command,

python evaluation.py demo_config/trec-bert_eval.json

Details about parameters can be seen in dataclass ModelArguments and DataEvalArguments in evaluation.py.

Baselines

We reimplement the following baselines based on pretrained language models in textual data.

Here we provide our implementation details of the following methods in folder baselines.

Methods	Source Method Link
BERT-Base	Devlin et al.
BERT+Co-Teaching	Han et al.
BERT+Co-Teaching+	Yu et al.
BERT+SCE	Wang et al.
BERT+ELR	Liu et al.
BERT+CL	Northcutt et al.
BERT+NM-Net	Garg et al.

You may run script by the following command:

python script/train_asym.sh

You should modify the file path and parameters in train_asym.sh and train_idn.sh before running them.

Note that the code of baselines hasn't yet been merged into our new code framework, but for better usage we will get this done soon.

Results

Here we list our results in Trec datasets. More information about experiment details and results can be seen in our paper.

Noise Settings	Accuracy(best)	Accuracy(Last)
Trec(Asym-20%)	96.32	94.12
Trec(Asym-40%)	96.04	93.80

noise-learning / SelfMix