silverriver / PersonalDilaog

Scripts for constructing the PersonalDialog dataset (https://arxiv.org/abs/1901.09672)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

新闻

PersonalDialog数据集的一部分数据现在可以通过huggingface的datasets库访问和使用:https://huggingface.co/datasets/silver/personal_dialog

from datasets import load_dataset

dataset = load_dataset("silver/personal_dialog")

微博对话爬虫

本项目包含论文Personalized Dialogue Generation with Diversified Traits中构建数据集PersonalDialog时所使用的代码。

项目代码于2018-01-08 Fork自另外一个Repo,并在原代码的基础上修改完成。本仓库代码的最后修改时间为2018-04-21。

原代码库自2018年1月后的更新没有并入本代码库中。

使用方法请参照原项目

主要改进

  • 添加爬取对话功能
  • 添加代理
  • 修复数据爬取中的一些问题,如表情,emoji等

开发者

Contact

  • 关于PersonaDialog数据集的其他信息请联系 zhengyinhe1 at 163 dot com

Spider for Dialogs on Weibo

The code in this project was used for constructing the PersonalDialog data set introduced in the paper Personalized Dialogue Generation with Diversified Traits

The codebase was forked from another Repo in 2018-01-08. The last modification of this Repo was at 2018-04-21.

The commits of the original repro that was submitted after Jan. 2018 were NOT merged to this Repo. However, you can still refer to the wiki of the original Repo to setup your spider.

Major Improvements

  • Add code to crawl dialogs on Weibo.
  • Add code for using the proxy pool.
  • Fix some problems in the crawling process. Such as Facial expressions, or emoji.

Developers

Contact

  • Please contact zhengyinhe1 at 163 dot com for further assistants.

About

Scripts for constructing the PersonalDialog dataset (https://arxiv.org/abs/1901.09672)

License:MIT License


Languages

Language:Python 98.8%Language:Dockerfile 0.9%Language:Shell 0.4%