Jeiyoon / dstc10

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Track 5: Automatic Evaluation and Moderation of Open-domain Dialogue Systems

Members

Task Proposal and Track Website

Baselines

1) Deep AM-FM: Toolkit for Automatic Dialogue Evaluation (Zhang et al., Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore. 2021)

--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/twitter_trial_data_train_jeiyoon.txt
--output_dir=./embedding_models/full_am
--model_type=bert
--model_name_or_path=bert-base-uncased
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/dstc6_eval.json
--overwrite_output_dir
--per_device_train_batch_size=4
--per_device_eval_batch_size=4
--block_size=512
--mlm
  • Parameters (Fine-tuning FM on DSTC6)
--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/twitter_trial_data_train_jeiyoon.txt
--output_dir=./language_models/full_fm
--model_type=gpt2
--model_name_or_path=gpt2
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/dstc6_eval.json
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--block_size=512
  • Parameters (Fine-tuning AM on DSTC7)
--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/reddit_train_jeiyoon.txt
--output_dir=./dstc7_model/embedding_models/full_am
--model_type=bert
--model_name_or_path=bert-base-uncased
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/dstc7_eval.json
--overwrite_output_dir
--per_device_train_batch_size=4
--per_device_eval_batch_size=4
--block_size=512
--mlm
  • Parameters (Fine-tuning FM on DSTC7)
--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/reddit_train_jeiyoon.txt
--output_dir=./dstc7_model/language_models/full_fm
--model_type=gpt2
--model_name_or_path=gpt2
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/dstc7_eval.json
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--block_size=512
  • Parameters (Fine-tuning AM on PersonaCHAT)
--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/persona_train_jeiyoon.txt
--output_dir=./persona_model/embedding_models/full_am
--model_type=bert
--model_name_or_path=bert-base-uncased
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/convai2-grade_eval.json
--overwrite_output_dir
--per_device_train_batch_size=4
--per_device_eval_batch_size=4
--block_size=512
--mlm
  • Parameters (Fine-tuning FM on PersonaCHAT)
--train_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/persona_train_jeiyoon.txt
--output_dir=./persona_model/language_models/full_fm
--model_type=gpt2
--model_name_or_path=gpt2
--do_train
--do_eval
--eval_data_file=/root/dstc10/dstc10_metric_track-main/baselines/deep_amfm/DSTC_10_Track_5/Subtask_1/human_evaluation_data/human_evaluation_data/convai2-grade_eval.json
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--block_size=512
  • Parameters (Compute Reference-based AM-FM Scores for Turn-level Dataset)
--dataset=dstc6
--device=cuda:0
--am_model_path=dstc7_model/embedding_models/full_am
--fm_model_path=dstc7_model/language_models/full_fm
[wr] DSTC6-Eval (D6) (Hori et al., 2017)
[wr] DSTC7-Eval (D7) (Galley et al., 2019)
[wr] DailyDialog-Eval (GD) (Gupta et al., 2019)
[wr] DailyDialog-Eval (ZD) (Zhao et al., 2020)
[wr] HUMOD (HU) (Merdivan et al., 2020)
[wr] PersonaChat-USR (UP) (Mehri & Eskenazi, 2020a)
[wr] PersonaChat-Eval (ZP) (Zhao et al., 2020)
[wr] TopicalChat-USR (TP) (Mehri & Eskenazi, 2020a)
  • Parameters (Compute Reference-free AM-FM Scores for Turn-level Dataset)
--dataset=fed-turn
--device=cuda:1
--am_model_path=dstc7_model/embedding_models/full_am
--fm_model_path=dstc7_model/language_models/full_fm
[wor] FED-Turn (FT) (Mehri & Eskenazi, 2020b)
[wor] ConvAI2-Eval (EC) (Huang et al., 2020)
[wor] Empathetic-Eval (EE) (Huang et al., 2020)
[wor] DailyDialog-Eval (ED) (Huang et al., 2020)
  • Parameters (Compute Reference-free AM-FM Scores for Dialogue-level Dataset)
--dataset=fed-dial
--device=cuda:2
--am_model_path=dstc7_model/embedding_models/full_am
--fm_model_path=dstc7_model/language_models/full_fm
[dial] FED-Conversation (FC) (Mehri & Eskenazi, 2020b)
[dial] Persona-Chatlog (PC) (See et al., 2019)

2) D-score: Holistic Dialogue Evaluation without Reference (Zhang et al., IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021)

Datasets

1) DSTC6 Customer Support Dataset & The Human Evaluation Dataset

https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling

For any use of the dataset, please cite

@article{hori2017end,
  title={End-to-end conversation modeling track in DSTC6},
  author={Hori, Chiori and Hori, Takaaki},
  journal={arXiv preprint arXiv:1706.07440},
  year={2017}
}

2) DSTC7 Knowledge Grounding Dataset & The Human Evaluation Dataset

https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling

For any use of the dataset, please cite

@inproceedings{Galley2019GroundedRG,
  title={Grounded Response Generation Task at DSTC7},
  author={Michel Galley and Chris Brockett and Xiang Gao and Jianfeng Gao and B. Dolan},
  booktitle = {Dialog System Technology Challenges (DSTC7)},
  year={2019}
}

3) PersonaCHAT

https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2

For any use of the dataset, please cite

@inproceedings{zhang2018personalizing,
  title={Personalizing Dialogue Agents: I have a dog, do you have pets too?},
  author={Zhang, Saizheng and Dinan, Emily and Urbanek, Jack and Szlam, Arthur and Kiela, Douwe and Weston, Jason},
  booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2204--2213},
  year={2018}
}
@article{dinan2019second,
  title={The second conversational intelligence challenge (convai2)},
  author={Dinan, Emily and Logacheva, Varvara and Malykh, Valentin and Miller, Alexander and Shuster, Kurt and Urbanek, Jack and Kiela, Douwe and Szlam, Arthur and Serban, Iulian and Lowe, Ryan and others},
  journal={arXiv preprint arXiv:1902.00098},
  year={2019}
}
@article{miller2017parlai,
  title={ParlAI: A Dialog Research Software Platform},
  author={{Miller}, A.~H. and {Feng}, W. and {Fisch}, A. and {Lu}, J. and {Batra}, D. and {Bordes}, A. and {Parikh}, D. and {Weston}, J.},
  journal={arXiv preprint arXiv:{1705.06476}},
  year={2017}
}

Deep Multi-Task and Meta-Learning

https://github.com/Jeiyoon/CS330-Stanford-Deep-Multi-Task-and-Meta-Learning

Research Note (notion)

  • https://www.notion.so/DSTC10-Automatic-Evaluation-and-Moderation-of-Open-domain-Dialogue-Systems-dc455b5598c240e3b4a20c66b5a884af

  • The system-level correlation means that for a group of dialogue systems to rank, each system will receive a single metric score. Then we do correlation between the list of scores and the corresponding human annotated scores. In the experiment, one can simply average the scores of all the responses from a dialogue system in the test set and treat the averaged score as the system score. (same procedure for the corresponding human scores)

  • Conversation-level correlation intends to rank a list of conversations. In the interactive human evaluation setting, the annotator will give a single rating to the entire conversation. Then the automatic metric will also need to assign a single score to the entire conversation. Correlation is performed between these two groups of scores. One simple way to obtain a single conversational level metric score is to average the scores assigned to all the context-response pairs within the conversation.

  • Turn-level is the most fine-grained category. It is the common approach in static evaluation whereby we have multiple context-response pairs, the annotators will annotate the quality of the response. Then, the metric will assign score for each context-response pair. Correlation is performed between these two groups of scores

About


Languages

Language:Python 100.0%