Sparse Neural Editor
This repo is the PyTorch implementation of this paper:
Learning Sparse Prototypes for Text Generation
Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig
NeurIPS 2020
In this repo, we implement a generative model of text that generates sentences by editying non-parametric prorotypes. The prototype support set is encouraged to be sparse during training to improve the memory/time efficiency at test time.
Dependencies
The code mainly requires PyTorch (>=1.4.0) and fairseq (we run our experiments based on this specific commit).
Install dependencies:
# install fairseq from a specific commit
git clone git@github.com:pytorch/fairseq.git fairseq_local
cd fairseq_local
git reset --hard b65a85b
# a modified sequence_generator.py to use edit vectors
cp ../sparse_prototype/sequence_generator.py fairseq
pip install --editable ./
cd ..
# install additional dependencies
pip install -r requirements.txt
Prepare Data
# download coco data
gdown https://drive.google.com/uc?id=1fMBZnMZz46qC0Im6y53MnDDQGRuwoC_M
# download yelp medium data
gdown https://drive.google.com/uc?id=1Bgk94NZeoexdCWF_WPMoIFPLRjJsbuBF
# download yelp large data
gdown https://drive.google.com/uc?id=1Z6wc4n5UBghwyNOo-C41vXEdNG5CE1Pa
mkdir datasets
# take coco dataset as an example
tar -xvzf coco40k.tar.gz -C datasets
# binarize dataset for fairseq
bash scripts/binarize_data.sh coco40k
# generate a mask file which is used to avoid selecting
# exactly the same example as prototype during training
python scripts/get_mask_ids.py coco40k
Training
We first pre-compute the sentence embeddings for all data examples offline and save them in memory-mapped files using np.memmap
. During training/evaluation, a bilinear transformation is applied between these data embeddings and prototype embeddings to obtain the retrieval distribution. Here we use BERT as the offline encoder:
# embeddings are saved into pretrained_sent_embeddings/[dataset name]
CUDA_VISIBLE_DEVICES=xx python scripts/precompute_bert.py [dataset name]
GloVe embeddings are used in the paper to initialize word embeddings:
wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
mkdir glove_embeddings
unzip glove.6B.zip -d glove_embeddings
# compress glove embeddings to generate a new embedding file
# that only contains the dictionary of the dataset
python scripts/compress_glove.py \
--embed-path glove_embeddings/glove.6B.300d.txt \
--dict-path data-bin/[dataset_name]/dict.txt \
> glove_embeddings/[dataset_name]_glove.txt
Train the model:
# train the sparse neural editor
# [GPUs] can be multiple ids to perform data-parallel training
# some hyperparameters can be specified (e.g. -a [alpha]), see
# details in the script
bash scripts/train.sh -g [GPUs] -d [dataset name]
# train lm baseline
bash scripts/train.sh -g [GPUs] -c lm_baseline -d [dataset name]
Evaluation
compute ppl:
# approximate importance-weighted ppl
bash scripts/train.sh -g [GPUs] -d [dataset name] -e iw -p [checkpoint directory]
# pruning prototypes can be performed at eval time
# [prune num] is the number of prototypes kept
bash scripts/train.sh -g [GPUs] -d [dataset name] -u [prune num] -e iw -p [checkpoint directory]
Template-based Generation
See the notebook generate_demo.ipynb
(mainly the sample_from_cluster
function) for examples to load the pretrained model and generate based on given templates.
Citation
@inproceedings{he2020learning,
title={Learning Sparse Prototypes for Text Generation},
author={He, Junxian and Berg-Kirkpatrick, Taylor and Neubig, Graham},
booktitle={Proceedings of NeurIPS},
year={2020}
}