This repository is based on Lightning Transformers.
- Installation
- Related papers
- Use CT Loss in your work
- Test or interact with our checkpoints
- Training by yourself
- Test or interact with your trained model
This repository contains the official source code for the following papers:
[1] A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration
If you use this work, please cite our paper:
@article{jiang2022contrastive,
doi = {10.48550/ARXIV.2205.02517},
url = {https://arxiv.org/abs/2205.02517},
author = {Jiang, Shaojie and Zhang, Ruqing and Vakulenko, Svitlana and de Rijke, Maarten},
title = {A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration},
publisher = {arXiv},
year = {2022},
}
[2] Weakly Supervised Turn-level Engagingness Evaluator for Dialogues
@inproceedings{jiang2023weakly,
author = {Jiang, Shaojie and Vakulenko, Svitlana and de Rijke, Maarten},
title = {Weakly Supervised Turn-level Engagingness Evaluator for Dialogues},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3576840.3578319},
doi = {10.1145/3576840.3578319},
keywords = {Conversation analysis, engagingness, user experience},
location = {Austin, TX, USA},
series = {CHIIR '23}
}
Clone and change your working directory to this repo's root dir
git clone https://github.com/ShaojieJiang/lit-seq.git
cd lit-seq
pip install . # Tested with Python >= 3.7.0
This repo depends on our Python package ct-loss
, which is a PyTorch loss function for reducing generative repetitions
of auto-regressive language models.
Using ct-loss
in your work is very simple, please take a look at this repo.
The pretrained checkpoints used in paper [1] are now on Hugging Face Hub, so you can easily reproduce the results reported in our paper, or interact
with our pretained models.
Here is the notebook to interact with our models on Google Colab.
For reproducing the test results on your local server, or interacting with the GPT2-small model finetuned on Wikitext-103:
python lit --config-name lm backbone.pretrained_model_name_or_path=NeuralNotwork/gpt2-ct stage=[test | interact]
Interacting with a language model, you can get continuations to your input prefix.
For the BlenderBot dialogue model:
python lit --config-name dialogue_multi backbone.pretrained_model_name_or_path=NeuralNotwork/blenderbot-400M-ct stage=[test | interact]
Interacting with a dialogue model, you can get responses to your input message.
If you don't need the W&B logging, add log=False
to the above commands.
You can also reproduce our training using the instructions below.
All the data downloading and preprocessing are taken care of automatically.
All default hyper-parameters for reproducing our results are already in their corresponding conf/*.yaml
configuration files.
Simply run the following commands.
NOTE: For preprocessing big datasets such as
Wikitext-103
andDSTC8-Reddit
, it may take longer, more CPU memory and CPU cores for the first time. But thanks to Hugging Face Datasets, once the datasets are preprocessed and cached locally, the subsequent runs should take much less memory (25GB or less) and CPU cores (usually two are enough) to run, and should be loaded instantly.
python lit.py --config-name lm dataset.cfg.dataset_config_name=wikitext-103-raw-v1 [OPTIONS]
For customising the training, consider the following options:
optinal arguments | values | explanation |
---|---|---|
task.cfg.ct_seq_len | Positive integer | Suggested to be 1/4 (rounded) of the cross-entropy sequence length (maximum training length). Default to 150 |
task.cfg.preced_m_negatives | Integer > -1 | -1 means using all preceding tokens as negatives, 0 use none, k>0 uses k . Suggested to be 1/8 of the cross-entropy sequence length (max training length). Default to 60 |
task.cfg.negative_method | ct, ul, nce, simctg | Which method to use for penalizing negative tokens. ct : contrastive token; ul : unlikelihood training; nce : noise-contrastive estimation; simctg : SimCTG (training objective only); Default to ct |
task.cfg.ul_seq | True, False | Whether to use sequence level of UL or not. Default to False |
task.cfg.simctg | True, False | Whether to use simctg loss. Default to False |
training.lr | Float | Learning rate. Default to 1e-5 |
trainer.default_root_dir | Path to your checkpoint location | Default to ${HOME}/storage/trained/lit/${task.cfg.task_name}/${backbone.pretrained_model_name_or_path}_${dataset.cfg.pretrained_dataset_name} |
python lit.py --config-name dialogue_multi [OPTIONS]
For customising the training, consider these options:
optinal arguments | values | explanation |
---|---|---|
task.cfg.ct_seq_len | Positive integer | Suggested to be 1/4 (rounded) of the cross-entropy sequence length (maximum training length). Default to 30 |
task.cfg.preced_m_negatives | Integer > -1 | -1 means using all preceding tokens as negatives, 0 use none, k>0 uses k . Suggested to be 1/8 of the cross-entropy sequence length (max training length). Default to 15 |
task.cfg.negative_method | ct, ul, nce, simctg | Which method to use for penalizing negative tokens. ct : contrastive token; ul : unlikelihood training; nce : noise-contrastive estimation; simctg : SimCTG (training objective only); Default to ct |
task.cfg.ul_seq | True, False | Whether to use sequence level of UL or not. Default to False |
task.cfg.simctg | True, False | Whether to use simctg loss. Default to False |
training.lr | Float | Learning rate. Default to 1e-5 |
trainer.default_root_dir | Path to your checkpoint location | Default to ${HOME}/storage/trained/lit/${task.cfg.task_name}/${backbone.pretrained_model_name_or_path}_${dataset.cfg.pretrained_dataset_name} |
To reproduce the training in work [2]:
python lit.py --config-name rdep_hier_multi dataset.cfg.history_size=3 trainer.default_root_dir='your_path_to_save_checkpoints
To test or interact with the models trained by yourself:
python lit --config-name [lm | dialogue_multi] trainer.default_root_dir='your_path_to_saved_checkpoints' stage=[test | interact]
To test the trained evaluator on the FED dataset:
export DATASET=fed # or daily_dialog_engaging
python lit.py --config-name rdep_hier dataset.cfg.history_size=3 trainer.default_root_dir='your_path_to_save_checkpoints' stage=test log=False dataset=nlp/text_regression/${DATASET}
Please observe the Apache 2.0 license that is listed in this repository.
Coming soon. Tested on the following tasks:
- Language modeling
- Conversation