hred seq2seq timing multi-turn open-domain dialogue generative gnn gcn gat

WhenToTalk

Make the model decide when to utter the utterances in the conversation, which can make the interaction more engaging.

Model architecture:

GCN for predicting the timing of speaking
- Dialogue-sequence: Sequence of the dialogue history
- User-sequence: User utterance sequence
- PMI: Context relationship
(Seq2Seq/HRED) for language generation
Multi-head attention for dialogue context (use GCN hidden state)

Requirements

Pytorch 1.2
PyG
numpy
tqdm
nltk: word tokenize and sent tokenize
BERTScore 0.2.1

Dataset

Format:

Corpus folder have lots of sub folder, each named as the turn lengths of the conversations.
Each sub folder have lots of file which contains one conversation.
Each conversation file is the tsv format, each line have four element:
- time
- poster
- reader
- utterance

Create the dataset

# ubuntu / cornell, cf / ncf. Then the ubuntu-corpus folder will be created
# ubuntu-corpus have two sub folder (cf / ncf) for each mode
./data/run.sh ubuntu cf

Metric

Language Model: BLEU4, PPL, Distinct-1, Distinct-2
Talk timing: F1, Acc
Human Evaluation: Engaging evaluation

Baselines

1. Traditional methods

Seq2Seq
HRED / HRED + CF

2. Graph ablation learning

w/o BERT Embedding cosine similarity
w/o User-sequence
w/o Dialogue-sequence

How to use

Generate the graph of the context

# generate the graph information of the train/test/dev dataset
./run.sh graph cornell when2talk 0

Analyze the graph context coverage information

# The average context coverage in the graph: 0.7935/0.7949/0.7794 (train/test/dev) dataset
./run.sh stat cornell 0 0

Generate the vocab of the dataset

./run.sh vocab ubuntu 0 0

Train the model (seq2seq / seq2seq-cf / hred / hred-cf):

# train the hred model on the 4th GPU
./run.sh train ubuntu hred 4

Translate the test dataset by applying the model

# translate the test dataset by applying the hred model on 4th GPU
./run.sh translate ubuntu hred 4

Evaluate the result of the translated utterances

# evaluate the translated result of the model on 4th GPU (BERTScore need it)
./run.sh eval ubuntu hred 4

Generate performance curve

./run.sh curve dailydialog hred-cf 0

Chat with the model

./run.sh chat dailydialog GatedGCN 0

Experiment Result

wait to do:
1. add GatedGCN to all the graph-based method
2. add BiGRU to all the graph-based method
3. refer the DialogueGCN to construct the graph
    * the complete graph in the **p** windows size
    * add one long edge out of the windows size to explore long context sentence
    * user embedding as the node for processing
4. Layers analyse of the GatedGCN in this repo and mutli-turn modeling

Methods
- Seq2Seq: seq2seq with attention
- HRED: hierarchical context modeling
- HRED-CF: HRED model with classification for talk timing
- When2Talk: GCNContext modeling first and RNN Context later
- W2T_RNN_First: BiRNN Context modeling first and GCNContext later
- GCNRNN: combine the Gated GCNContext and RNNContext together (?)
- GatedGCN: combine the Gated GCNContext and RNNContext together
  1. BiRNN for background modeling
  2. Gated GCN for context modeling
  3. Combine GCN embedding and BiRNN embedding, final embedding
  4. Low-turn examples trained without the GCNConv (only use the BiRNN)
  5. Separate the decision module and generation module is better
- W2T_GCNRNN: RNN + GCN combine RNN together (W2T_RNN_First + GCNRNN)

Automatic evaluation

Compare the PPL, BLEU4, Disctint-1, Distinct-2 score for all the models.

Proposed classified methods need to be cascaded to calculate the BLEU4, BERTScore (the same format as the traditional models' results)

Model	Dailydialog				Cornell
Model	BLEU	Dist-1	Dist-2	PPL	BLEU	Dist-1	Dist-2	PPL
Seq2Seq	0.1038	0.0178	0.072	29.0640	0.0843	0.0052	0.0164	45.1504
HRED	0.1175	0.0176	0.0571	29.7402	0.0823	0.0227	0.0524	39.9009
HRED-CF	0.1268	0.0435	0.1567	29.0111	0.1132	0.0221	0.0691	38.5633
When2Talk	0.1226	0.0211	0.0608	24.0131	0.0996	0.0036	0.0073	32.9503
W2T_RNN_First	0.1244	0.0268	0.0787	24.5056	0.1118	0.0065	0.0147	33.754
GCNRNN	0.1250	0.0214	0.0624	25.8213	0.1072	0.0077	0.0188	33.9572
W2T_GCNRNN	0.1246	0.0152	0.0400	23.4434	0.1107	0.0063	0.0142	34.4256
GatedGCN	0.1231	0.0423	0.1609	27.1615	0.1157	0.0261	0.0873	34.4256

F1 metric for measuring the accuracy for the timing of the speaking, only for classified methods (hred-cf, ...). The stat data shows that the number of the negative label is the half of the number of the positive label. F1 and Acc maybe suitable for mearusing the result instead of the F1. In this settings, we care more about the precision in F1 metric.

Model	Dailydialog		Cornell
Model	Acc	F1	Acc	F1
HRED-CF	0.8272	0.8666	0.7708	0.8427
When2Talk	0.7992	0.8507	0.7616	0.8388
W2T_RNN_First	0.8144	0.8584	0.7481	0.8312
GCNRNN	0.8176	0.8635	0.7598	0.8445
W2T_GCNRNN	0.7565	0.8434	0.7853	0.8466
GatedGCN	0.8226	0.8663	0.738	0.8181

Human judgments (engaging, ...)

Invit the volunteer to chat with these models (seq2seq, hred, seq2seq-cf, hred-cf,) and score the models' performance accorading to the Engaging, Fluent, ...

Dailydialog dataset

Model	When2Talk vs.			kappa
Model	win(%)	loss(%)	tie(%)	kappa
Seq2Seq
HRED
HRED-CF

Cornell dataset

Model	When2Talk vs.			kappa
Model	win(%)	loss(%)	tie(%)	kappa
Seq2Seq
HRED
HRED-CF

Graph ablation learning
- F1 accuracy of predicting the speaking timing (hred-cf,)
- BLEU4, BERTScore, Distinct-1, Distinct-2

About

The codes of our paper When to Talk: Chatbot Controls the Timing of Talking during Multi-turn Open-domain Dialogue Generation

https://arxiv.org/abs/1912.09879

hred seq2seq timing multi-turn open-domain dialogue generative gnn gcn gat

MIT License

Languages

Language:Jupyter Notebook 57.2%Language:Python 40.1%Language:Shell 2.6%