qhjqhj00/SIGIR2021-Pchatbot

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

Introduction

We introduce Pchatbot, a large scale conversation dataset dedicated for the development of personalized dialogue models. In this dataset, we assign anonymized user IDs and timestamps to conversations. Users’ dialogue histories can be retrieved and used to build rich user profiles. With the availability of the dialogue histories, we can move from personality based models to personalized models.

Pchatbot has two subsets, named PchatbotW and PchatbotL, built from open-domain Weibo and judicial forums respectively. Since the data volume of each sub-data set is too large, we divided each sub-data set into 10 equal parts according to the number of users, and named them PchatbotW-i and PchatbotL-i.

The dataset paper is accepted to SIGIR 2021 (Resource Track). See paper for more details.

Citation

If you use the dataset in your work, please cite:

@inproceedings{qian2021pchatbot,
     author = {Hongjin Qian and Xiaohe Li and Hanxun Zhong and Yu Guo and Yueyuan Ma and Yutao Zhu and Zhanliang Liu and Zhicheng Dou and Ji-Rong Wen}, 
     title = {Pchatbot: A Large-Scale Dataset for Personalized Chatbot}, 
     booktitle = {Proceedings of the {SIGIR} 2021}, 
     publisher = {{ACM}}, 
     year = {2021}, 
     url = {https://doi.org/10.1145/3404835.3463239}, 
     doi = {10.1145/3404835.3463239}}

The following paper uses the Pchatbot dataset:

One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles (SIGIR 2021 Long Paper)

@inproceedings{DBLP:conf/sigir/madousigir21,
     author = {Zhengyi Ma and Zhicheng Dou and Yutao Zhu Hanxun Zhong and Ji-Rong Wen}, 
     title = {One Chatbot Per Person: Creating Personalized Chatbots based onImplicit User Profiles}, 
     booktitle = {Proceedings of the {SIGIR} 2021}, 
     publisher = {{ACM}}, 
     year = {2021}, 
     url = {https://doi.org/10.1145/3404835.3462828}, 
     doi = {10.1145/3404835.3462828}}

Learning Implicit User Profile for Personalized Retrieval-based Chatbot (CIKM 2021 Long Paper)

@inproceedings{qian2021impchat,
     author = {Hongjin Qian and Zhicheng Dou and Yutao Zhu Yueyuan Ma and Ji-Rong Wen}, 
     title = {Learning Implicit User Profile for Personalized Retrieval-based Chatbot}, 
     booktitle = {Proceedings of the {CIKM} 2021}, 
     publisher = {{ACM}}, 
     year = {2021},
     url = {https://doi.org/10.1145/3459637.3482269},
     doi = {10.1145/3459637.3482269}

Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation (NAACL 2022 Long Paper)

@inproceedings{zhong-etal-2022-less,
    title = "Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation",
    author = "Zhong, Hanxun  and
      Dou, Zhicheng  and
      Zhu, Yutao  and
      Qian, Hongjin  and
      Wen, Ji-Rong",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.426",
    doi = "10.18653/v1/2022.naacl-main.426",
    pages = "5808--5820"

Dataset Statistics

The detailed statistics of Pchatbot shows as follow:

	PchatbotW	PchatbotL	PchatbotW-1	PchatbotL-1
#Posts	5,319,596	20,145,956	3,597,407	4,662,911
#Responses	139,448,339	59,427,457	13,992,870	5,523,160
#Users in posts	772,002	5,203,345	417,294	1,107,989
#Users in responses	23,408,367	203,636	2,340,837	20,364
Avg.#responses per post	26.214	2.950	3.890	1.184
Max.#responses per post	525	120	136	26
#Words	8,512,945,238	3,013,617,497	855,005,996	284,099,064
Avg.#words per pair	61.047	51.014	61.103	51.438

We construct two standard dataset from Pchatbot for both generation-based and retrieval-based tasks, named PchatbotW-R and PchatbotW-G. The two datasets can be directly used in coressponding dialogue tasks. We will release the standard dataset then. Their statistics are shown in the following table:

	PchatbotW-R	PchatbotW-G
Number of users	420,000	300,000
Avg. history length	32.3	11.4
Avg. length of post	24.9	22.9
Avg. length of response	10.1	9.6
Number of response candidates	10	-
Number of training samples	3,000,000	2,707,880
Number of validation samples	600,000	600,000
Number of testing samples	600,000	600,000

To obtain statistics, run:

python src/statistics.py

We will then release standard datasets for PchatbotL.

Data Content and Format

Obtain the data

For now we provide download link via Baidu Cloud which may be a bit slow outside mainland, China. We will update Google drive link asap.

Pchatbot-L:

md5: 48bd7ab93f625ebdf34c7254ff27ac2a

Pchatbot-W:

md5: cd443951973f47f5614df298e6e416da

If you cannot access Baidu Cloud Disk, contact us and we will try to provide other options.

Please fill in the application form and send it to the contact mail, we will then send the download links and the password for Baidu Cloud Disk to you. Note that the application form should be signed by the person in charge of your research group. We will update the download password regularly.

Application Form

Pchatbot Files

The upload format of the dataset is .tar.bz2, you can decompress it as follows：

tar -jxvf xx.tar.bz2

The format of each piece of data in the data set is：

Post \t Post_user_id \t Post_timestamp \t Response \t Response_user_id \t Response_timestamp \n

post and response are sentences with word segmentation, separated by spaces.And we give several examples of the data in data/sample.txt

We also give some examples of user personalized information, as shown in the figure below, due to space constraints, we only selected 5 historical records for the user in each example. PchatbotW.release_ver

Post	酒酿小圆子窝蛋，蒸南瓜玉米和阳光玫瑰山寨一把芳婆的酒酿圆子，挺好吃的，加了点干桂花增香
Response	干桂花是点睛
History	History Post	History Response
	今日晚餐黄焖鸡，红烧带鱼和丝瓜蛋汤淘鲜达送来的带鱼不好，说是中段，实际是前段和尾巴，没多少肉都懒得拍。黄焖鸡太下饭啦，和家属都添了小半碗米饭。下午做的巧克力冰淇淋，味道棒棒哒	烦烦和光光就是永远都吃不胖的神仙体质
	因為荔枝樹不是每年都能結果，不是每年都能吃到，但卻是每年夏天我最期待的水果，期待的童年味，在河邊玩耍，在樹下等荔枝的夏日。	一定要有机会了去南方看看荔枝树的样子
	用喜欢的餐具穿舒适的衣裙吃简单可口的食物这些小快乐足以点亮平淡的生活餐具白裙子 by	穿搭博主好美呀
	柠檬冰淇淋搞定！还有强行出镜的柠檬扇子广告，这么尬为啥还要发呢？因为那个抠门的家伙给我的寄了一箱子芒果，所谓拿人手短吃人嘴软，希望对方也有这样的觉悟	柠檬盘子也很好看
	天天和徐大美丽混一起。	这个蘑菇看起来特别好吃

Post	woj ：考辛斯寻求一份年薪在 1200-1800万的合同。但是现在甚至没有球队愿意给他一份中产合同
Response	200万湖人要了
History	History Post	History Response
	别问我支持火箭还是勇士了我支持小卡凌晨 4点在洛杉矶跑步、被一个老外拖进篮球场、教了几个小时后仰跳投	我怀疑你在开车，但是我没有证据
	消息：鹈鹕本来对湖人之前给的筹码很心动，但是现在莺歌的病情改变了一切	我谢谢您嘞，去去去快去换季后赛塔图姆吧
	小卡会成为三连冠终结者王朝毁灭者吗？？	哈哈职业阻止三连冠
	水花兄弟这两位，场下真的暖，场上关键时刻真的硬作为两队的中立球迷，这场比赛给我看的热血沸腾了，火箭最后也一直坚挺着，真的精彩，真的	勇士火箭都不喜欢，甚至有点讨厌，但是今天这场比赛，确实勇士更值得赢
	大家觉得猛龙和雄鹿谁最有可能进入到总决赛？	范乔丹：看老子心情吧

Data Preprocessing

Instructions for data cleaning, preprocessing, aggregation and dataset constructs are in ./src/ folder.

Baseline models

We provide results of baseline models on the PchatbotW-R and PchatbotW-G dataset. For evaluation details, please refer to our paper. We will continue to update the results of other baseline models:

PchatbotW-R

	R10@1	R10@2	R10@5	MRR	nDCG	Paper	Code
Conv-KNRM	0.323	0.520	0.893	0.538	0.818	Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search	https://github.com/yunhenk/Conv-KNRM
DAM	0.438	0.644	0.966	0.635	0.881	Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network	https://github.com/baidu/Dialogue
IOI	0.442	0.651	0.969	0.639	0.890	One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues	https://github.com/chongyangtao/IOI
RSM-DCK	0.428	0.627	0.947	0.623	0.858	Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems	Provided by the author

PchatbotW-G

	BLEU-1	ROUGE-L	Dist-1	Dist-2	P-F1	Paper	Code
Seq2Seq	4.889	7.594	0.229	3.404	0.771	Sequence to Sequence Learning with Neural Networks	https://github.com/IBM/pytorch-seq2seq
SPEAKER	3.958	5.580	0.951	29.780	1.534	A Persona-Based Neural Conversation Model	\
PERSONAWAE	1.945	9.064	0.523	8.549	6.408	Modeling Personalization in Continuous Space for Response Generation via Augmented Wasserstein Autoencoders	\
DialoGPT	5.038	7.358	13.995	52.674	3.562	DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation	https://github.com/microsoft/DialoGPT

License

This repository is liciensed under Apache-2.0 License.

The Pchatbot dataset is liciensed under CC BY-NC 2.0.

qhjqhj00 / SIGIR2021-Pchatbot