Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)

Most of codes of this work come from Fairseq and Transformers.

Description for directories

poattention (modified from Fairseq): Training the Position-Aware Embedding Generator for seq2seq models.
use_poattention (modified from Fairseq): Generating embeddings for unseen tokens as well as fine-tuning the seq2seq model with a vocabulary for downstream data under the downstream task.
bert_poattention (modified from Transformers): Training the Position-Aware Embedding Generator for bert-like models.
bert_use_poattention (modified from Fairseq): Generating embeddings for unseen tokens, converting parameters of bert-like model to seq2seq one, as well as fine-tuning the seq2seq model with a newly generated vocabulary under the downstream task.

How to run

For seq2seq pretrained model

`poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Move the seq2seq pretrained model (generated by Fairseq) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_pretrained_model ./checkpoints/checkpoint_last.pt
Train the embedding generator

pip install .; bash train.sh
Stop training when model tends to coverage.

`use_poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Move the well-trained embedding genearator checkpoint (generated by poattention) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_embedding_generator ./checkpoints/checkpoint_last.pt
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install .; bash train.sh

For bert-like pretrained model

`bert_poattention`

Prepare the upstream data (plain text) at ./examples/language-modeling/data.
Train the embedding generator

pip install .

cd ./examples/language-modeling

bash train_mlm.sh

`bert_use_poattention`

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin.

Note: Sentences should be cutted by WordPiece, I suggest the bert-vocab-builder for building the vocabulary of downstream data.
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install path_to_bert_poattention

pip install .; bash train.sh

About

Code for “Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation” (ACL2021)

Languages

Language:Python 92.3%Language:Jupyter Notebook 3.6%Language:xBase 1.9%Language:Shell 1.1%Language:Cuda 0.5%Language:C++ 0.2%Language:Cython 0.1%Language:JavaScript 0.1%Language:Lua 0.1%Language:CSS 0.0%Language:Dockerfile 0.0%Language:Makefile 0.0%Language:Batchfile 0.0%Language:Jsonnet 0.0%