Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
Sandeep Subramanian, Adam Trischler, Yoshua Bengio & Christopher Pal
ICLR 2018
GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. These representations are useful for transfer and low-resource learning. For details please refer to ICLR paper.
For building the vocabulary process and tokenization process, we are using sentencepiece model.
You can train the sentencepiece model by running, based on datasets available in config.json
under tasks
key.
python train_sentencepiece.py
Note: you can replace the configurations of sentencepiece model in the provided config.json
under vocab
key.
Use tokenization.py
to create a pre-training dataset from a dump of raw text. It has the following arguments:
--model_path
: File defining the sentencepiece tokenizer.--num_workers
: If >1 parallelize across multiple processes (4 by default).
Usage example:
python tokenization.py \
--model_path=./tokenizer.model \
--num_workers=6
Note: this script reading the data files from the provided config.json
.
To train a model from scratch, simply run train.py
with an appropriate JSON config file. An example config is provided in config.json
.
python train.py
Once you have a trained model, we can throw away all of the decoders and just retain the encoder used to compute sentence representations.
You can do this by running
python extract_encoder.py -t <path_to_trained_model> -s <path_to_save_encoder>
Once you have done this, you can load this model just like any of the pre-trained models by specifying the encoder and tokenizer folder as path_to_encoder
.
from encoder import GenSen
your_gensen = GenSen(
model_folder='encoder_and_tokenizer_directory',
trainable=True, # (for fine-tuning the encoder)
use_cuda=True
)
You can use our pre-trained models to extract the last hidden state or all hidden states of our multi-task GRU. Additionally, you can concatenate the output of multiple models to replicate the numbers in our paper.
from encoder import GenSen, GenSenEnsemble
sentences = [
'hello world .',
'the quick brown fox jumped over the lazy dog .',
'this is a sentence .'
]
gensen_1 = GenSen(
model_folder='/path/to/pretrained_model_1',
trainable=True, # (for fine-tuning the encoder)
use_cuda=True
)
reps_h, reps_h_t = gensen_1.get_representation(
sentences, pool='last', language='ar', normalize=True, add_start_end=True, return_numpy=True
)
print(reps_h.shape, reps_h_t.shape)
sentences
, which should be a list of strings. If your strings are not pre-tokenized, then setnormalize=True
to normalize and tokenize text before computing representations.pool
specifies the pooling operation to get the sentence representation from the encoder hidden states (last
ormax
).language
specifies the language of the passed sentences, to choose the appropriate tokenization and normalization functions (ar
,en
orbi
in case of bilingual sentences, useslangdetect
to detect langauge, which is slower than settinglanguage
beforehand).normalize
whether to normalize sentences before encoding (recommended).add_start_end
whether to add<s>
and</s>
tokens at the start and end of each sentence before encoding (recommended).
reps_h
(batch_size x seq_len x dim_src) contains the hidden states for all words in all sentences (padded to the max length of sentences)reps_h_t
(batch_size x dim_src) contains only the last hidden state for all sentences in the minibatch
GenSen will return the output of a single model. You can concatenate the output of multiple models by creating a GenSen instance with multiple GenSen instances, as follows:
gensen_2 = GenSen(
model_folder='/path/to/pretrained_model_2',
trainable=True, # (for fine-tuning the encoder)
use_cuda=True
)
gensen = GenSenEnsemble(gensen_1, gensen_2)
reps_h, reps_h_t = gensen.get_representation(
sentences, pool='last', language='ar', normalize=True, add_start_end=True, return_numpy=True
)
reps_h
(batch_size x seq_len x dim_src * 2) contains the hidden states for all words in all sentences (padded to the max length of sentences)reps_h_t
(batch_size x dim_src * 2) contains only the last hidden state for all sentences in the minibatch
The model will produce a fixed-length vector for each sentence as well as the hidden states corresponding to each word in every sentence (padded to max sentence length). You can also return a numpy array instead of a torch.FloatTensor
by setting return_numpy=True
.
If you have a specific domain for which you want to compute representations, you can call vocab_expansion
on instances of the GenSen or GenSenEnsemble class simply by gensen.vocab_expansion(vocab, pretrained_embeddings_path)
where vocab is a list of unique words in the new domain. This will learn a linear mapping from the provided pretrained embeddings (which have a significantly larger vocabulary) provided to the space of gensen's word vectors. For an example of how this is used in an actual setting, please refer to gensen_senteval.py
.
We use the SentEval toolkit to run most of our transfer learning experiments.
To replicate these numbers, clone their repository and follow setup instructions. Once complete, copy gensen/downstream/senteval.py
and gensen/encoder.py
into their examples folder and run the following commands to reproduce different rows in Table 2 of our paper. Note: Please set the path to the pretrained glove embeddings (glove.840B.300d.h5
) and model folder as appropriate.
We use the Arabic Language Understanding Evaluation (ALUE) benchmark to compare our pretrained encoders to other SOTA models (e.g. BERT).
Use gensen/downstream/alue_finetuning.py
to run ALUE benchmark. It has the following arguments:
--model_dir
: Folder defining the model encoder and the sentencepiece tokenizer.--pretrained_embeddings
: (Optional) If Not None, use vocab expansion technique (None by default).--finetuning_output
: (Optional) If Not None, Folder path to save the fine-tuning models (None by default).--tasks
: (Optional) List of ALUE benchmark tasks (All tasks by default).--seed
: (Optional) Random seed value (90 by default).--print_summary
: (Optional) whether print summary report or not (True by default).--report_path
: (Optional) If not None, File path to save the summary report as excel sheet (None by default).
Usage example:
python -m gensen.downstream.alue_finetuning \
--model_dir=<path_of_encoder_tokenizer> \
--pretrained_embeddings=<path_of_pretrained_embeddings> \
--finetuning_output=<path_to_save_finetuning_models> \
--tasks=Diagnostic,IDAT,Madar \
--seed=32 \
--print_summary=True \
--report_path=<path_to_save_report>
We assess the pretrained GenSen encoder bilingual capabilities using two tasks that we have built: Bilingual Natural Language Inference and Bilingual Semantic Sentence Similarity.
@article{subramanian2018learning,
title={Learning general purpose distributed sentence representations via large scale multi-task learning},
author={Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J},
journal={arXiv preprint arXiv:1804.00079},
year={2018}
}