Shared Task on Short Text Conversation

This repository provides the datasets for NTCIR STC (Short Text Conversation) Japanese subtask. For the details of the shared task, refer to the STC homepage.

Provided datasets

taskdata: Data from which tweets are to be retrieved. Provided tweets were randomly sampled from all the tweets in 2014 (1 million tweets; .5 million tweet pairs).
devset: Development data. For each input tweet, a baseline system retrieved tweets from taskdata. Each tweet pair is labeled by 10 annotators.

Tweets are provided as tweet IDs (id_str). Download original texts using twitter API, or buy the data by contacting the following address:

Email: info<at>nazuki-oto.com
Dataset name: NTCIR STC Japanese subtask data set

Schedule

Release of the Twitter data and starts accepting registration: Nov 01, 2015
Release of the development data: ~~Nov 20, 2015~~ Nov 24, 2015
Registration deadline of STC Japanese task: Jan 15, 2016
Release of the test data: Feb 15, 2016
Formal run deadline: Feb 22, 2016
Distribution of evaluation results: Mar 10, 2016
Paper draft deadline: Mar 20, 2016
(brief review of the draft papers)
Camera ready deadline: May 1, 2016
NTCIR-12 conference: Jun 7-10, 2016

Change log

Nov. 1, 2015: Twitter data (task data) released
Nov. 24, 2015:
- Task data fixed (some pairs are randomly removed so that the task data contains exactly one million tweets)
- Development data released
Feb. 8, 2016:
- The list of deleted tweets released
- The rules for formal run submission released
Feb. 15, 2016: Test data released
Jun. 27, 2016:
- Test data and annotated labels released

SuHui / stc

Shared Task on Short Text Conversation

Provided datasets

Schedule

Change log

About

Languages