suicao / yourself-simulator

Creating a chatbot from your facebook data with GPT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DIY chatbot from your Facebook data and pretrained language models

How this works

This model generates replies based on DialoGPT style language modeling, by concatenating all dialog turns in a conversation into a long text.

For example, the follow conversation:

Person: Do you want the Aladeen news or the Aladeen news?

You: The Aladeen news?

Person: You're HIV-Aladeen.

You: 😮

Will be transformed to the following format:

<s> Do you want the Aladeen news or the Aladeen news? </s> The Aladeen news? <s> You're HIV-Aladeen.<s/> 😮

We introduce two special tokens <s> and </s>, where <s> denotes the beginning of a reply by another person, and </s> by you.

Given that the training input is just a text sequence, it can be modeled using any causal language model and used to generate a reply based on the current context.

Formally, we concatenate all dialog turns within a dialogue session into a long text ,, We denote the source sentence (dialogue history) as where is the </s> token, and target sentence (ground truth response) as , the conditional probability of can be written as the product of a series of conditional probabilities:

Training

Prepare your data

Go to https://www.facebook.com/dyi/?referrer=yfi_settings to download an archive of your past data. Select the json format and low media quality for a smaller archive as we don't need the media files anyway.

Uncheck everything but the "Messages" box, Request your download and wait a few days for your archive to be available.

Unzip your data and run the following command:

python preprocess.py --input_path /<path-to-your-data>/inbox --output_path ./data/convs.json

The output format should look like ./data/sample.json

Training

Run the following command:

python  train.py --output_dir=output --model_type=gpt2 --do_train --model_name_or_path "suicaokhoailang/gpt-neo-vi-comments-finetuned" --block_size 128 --per_device_train_batch_size=16 --per_device_eval_batch_size=36 --gradient_accumulation_steps=4  --save_total_limit=5 --learning_rate=2e-5 --num_train_epochs=5 --save_steps=500  --overwrite_output_dir  --train_data_file=./data/convs.json --logging_steps 500 --seed 42069

There are a few candidates for the pretrained Vietnamese model, here I picked a version of gpt that I finetuned from NlpHUST/gpt-neo-vi-small on dataset of 10m Facebook comments, you may consider:

Inference

Run the following command to start a convesation with your trained model

python infer.py  --model_name_or_path "NlpHUST/gpt-neo-vi-small" --checkpoint_path ./output/pytorch_model.bin

About

Creating a chatbot from your facebook data with GPT


Languages

Language:Python 100.0%