Few-shot NER

The codebase to implement 3 baseline methods for the EMNLP paper ``Few-Shot Named Entity Recognition: An Empirical Baseline Study''

Dependencies:

To install the required packages by following commands:

$ pip3 install -r requirements.txt

Quickstart with the Noisy Supervised Pre-trained Checkpoint

Download the models pre-trained on WiFine (Wikipedia) to src/pretrained_models/ from checkpoint. To load model pre-trained on WiFine (Wikipedia) and fine-tune on CONLL2003 dataset,

cd src
bash ./train_lc.sh

By default, this runs 10 rounds of experiments with different sets of 5-shot seeds and allows self-training on the whole dataset.

Multiple Runs

To run multiple rounds of experiments on various few-shot seeds (e.g., 10 rounds), set

--train_text few_shot_5 --train_ner few_shot_5 --few_shot_sets 10

in the command. ''few_shot_5'' is the common file name of the seed files. The average results of F1-score will be output at the end.

If only one round is needed, you need to set the complete file names for training

--train_text train.words --train_ner train.ner

Enable Self-training

Set the files for self-training by

--unsup_text train.words --unsup_ner train.ner

The labels in ''unsup_ner'' are not used in training, but will be used for evaluation before self-training to give you a hint on how much potential you can get from self-training.

To disallow self-training, just remove the two relevant flags.

Use Your Own Pre-trained Model

If you want to load your own pre-trained model, set

--load_model True --load_model_name path/to/your/model

If you want to load the original pre-trained Roberta model (https://arxiv.org/abs/1907.11692), set

--load_model False

Use Prototype-based Methods

You can use prototype-based methods by running the following command

bash ./train_proto.sh

In this script, you can also allow or disallow multiple runs, and customize pre-trained models.

Benchmark Datasets

In our paper, we studied the result on 10 benchmark datasets. For the public ones, we provide our few-shot seed sets and the whole dataset here. For the other datasets which require license for access, if you want the same set of few-shot seeds, please first get the license for the whole dataset and then ask the first author for the sampled few-shot seeds.

Dataset	Domain	Included here
CoNLL	News	✔️
Onto	General	✖️
WikiGold	General	✔️
WNUT17	Social Media	✔️
MITMovie	Review	✔️
MITRestaurant	Review	✔️
SNIPS	Dialogue	✔️
ATIS	Dialogue	✔️
Multiwoz	Dialogue	✔️
i2b2	Medical	✖️

paihengxu / BaselineCode