Sparse Neural Editor

This repo is the PyTorch implementation of this paper:

Learning Sparse Prototypes for Text Generation
Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig
NeurIPS 2020

In this repo, we implement a generative model of text that generates sentences by editying non-parametric prorotypes. The prototype support set is encouraged to be sparse during training to improve the memory/time efficiency at test time.

Dependencies

The code mainly requires PyTorch (>=1.4.0) and fairseq (we run our experiments based on this specific commit).

Install dependencies:

# install fairseq from a specific commit
git clone git@github.com:pytorch/fairseq.git fairseq_local
cd fairseq_local
git reset --hard b65a85b

# a modified sequence_generator.py to use edit vectors
cp ../sparse_prototype/sequence_generator.py fairseq

pip install --editable ./

cd ..

# install additional dependencies
pip install -r requirements.txt

Prepare Data

# download coco data
gdown https://drive.google.com/uc?id=1fMBZnMZz46qC0Im6y53MnDDQGRuwoC_M

# download yelp medium data
gdown https://drive.google.com/uc?id=1Bgk94NZeoexdCWF_WPMoIFPLRjJsbuBF

# download yelp large data
gdown https://drive.google.com/uc?id=1Z6wc4n5UBghwyNOo-C41vXEdNG5CE1Pa


mkdir datasets

# take coco dataset as an example
tar -xvzf coco40k.tar.gz -C datasets

# binarize dataset for fairseq
bash scripts/binarize_data.sh coco40k

# generate a mask file which is used to avoid selecting 
# exactly the same example as prototype during training
python scripts/get_mask_ids.py coco40k

Training

We first pre-compute the sentence embeddings for all data examples offline and save them in memory-mapped files using np.memmap. During training/evaluation, a bilinear transformation is applied between these data embeddings and prototype embeddings to obtain the retrieval distribution. Here we use BERT as the offline encoder:

# embeddings are saved into pretrained_sent_embeddings/[dataset name]
CUDA_VISIBLE_DEVICES=xx python scripts/precompute_bert.py [dataset name]

GloVe embeddings are used in the paper to initialize word embeddings:

wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
mkdir glove_embeddings
unzip glove.6B.zip -d glove_embeddings

# compress glove embeddings to generate a new embedding file
# that only contains the dictionary of the dataset
python scripts/compress_glove.py \
		--embed-path glove_embeddings/glove.6B.300d.txt \
		--dict-path data-bin/[dataset_name]/dict.txt \
		> glove_embeddings/[dataset_name]_glove.txt

Train the model:

# train the sparse neural editor
# [GPUs] can be multiple ids to perform data-parallel training
# some hyperparameters can be specified (e.g. -a [alpha]), see 
# details in the script
bash scripts/train.sh -g [GPUs] -d [dataset name]

# train lm baseline
bash scripts/train.sh -g [GPUs] -c lm_baseline -d [dataset name]

Evaluation

compute ppl:

# approximate importance-weighted ppl
bash scripts/train.sh -g [GPUs] -d [dataset name] -e iw -p [checkpoint directory]

# pruning prototypes can be performed at eval time
# [prune num] is the number of prototypes kept
bash scripts/train.sh -g [GPUs] -d [dataset name] -u [prune num] -e iw -p [checkpoint directory]

Template-based Generation

See the notebook generate_demo.ipynb(mainly the sample_from_cluster function) for examples to load the pretrained model and generate based on given templates.

Citation

@inproceedings{he2020learning,
  title={Learning Sparse Prototypes for Text Generation},
  author={He, Junxian and Berg-Kirkpatrick, Taylor and Neubig, Graham},
  booktitle={Proceedings of NeurIPS},
  year={2020}
}

jxhe / sparse-text-prototype