This model generates replies based on DialoGPT style language modeling, by concatenating all dialog turns in a conversation into a long text.
For example, the follow conversation:
Person: Do you want the Aladeen news or the Aladeen news?
You: The Aladeen news?
Person: You're HIV-Aladeen.
You: 😮
Will be transformed to the following format:
<s> Do you want the Aladeen news or the Aladeen news? </s> The Aladeen news? <s> You're HIV-Aladeen.<s/> 😮
We introduce two special tokens <s>
and </s>
, where <s>
denotes the beginning of a reply by another person, and </s>
by you.
Given that the training input is just a text sequence, it can be modeled using any causal language model and used to generate a reply based on the current context.
Formally, we concatenate all dialog turns within a dialogue session into a long text
,, We denote the source sentence (dialogue history)
as
where
is the
</s>
token, and target sentence (ground truth response) as , the conditional probability of
can be written as the
product of a series of conditional probabilities:
Go to https://www.facebook.com/dyi/?referrer=yfi_settings to download an archive of your past data. Select the json format and low media quality for a smaller archive as we don't need the media files anyway.
Uncheck everything but the "Messages" box, Request your download and wait a few days for your archive to be available.
Unzip your data and run the following command:
python preprocess.py --input_path /<path-to-your-data>/inbox --output_path ./data/convs.json
The output format should look like ./data/sample.json
Run the following command:
python train.py --output_dir=output --model_type=gpt2 --do_train --model_name_or_path "suicaokhoailang/gpt-neo-vi-comments-finetuned" --block_size 128 --per_device_train_batch_size=16 --per_device_eval_batch_size=36 --gradient_accumulation_steps=4 --save_total_limit=5 --learning_rate=2e-5 --num_train_epochs=5 --save_steps=500 --overwrite_output_dir --train_data_file=./data/convs.json --logging_steps 500 --seed 42069
There are a few candidates for the pretrained Vietnamese model, here I picked a version of gpt that I finetuned from NlpHUST/gpt-neo-vi-small
on dataset of 10m Facebook comments, you may consider:
- https://huggingface.co/danghuy1999/gpt2-viwiki
- https://huggingface.co/imthanhlv/gpt2news
- https://huggingface.co/VietAI/gpt-neo-1.3B-vietnamese-news or https://huggingface.co/VietAI/gpt-j-6B-vietnamese-news (very large models)
Run the following command to start a convesation with your trained model
python infer.py --model_name_or_path "NlpHUST/gpt-neo-vi-small" --checkpoint_path ./output/pytorch_model.bin