Yonsei-ScriptCorpus

Dialogue events collection derived from scripts of movies and dramas; for training and testing dialogue / chatbot systems

Basic statistics of dataset (For now. It is increasing now.) 9,423 dialogue events 86,295 turns of speaking (a couple of utterances per turn) / 18,848 speakers
Description of dataset Conversation text data are in the two dataset folers (dataset_type1, 2), and their meta information are in the spreadsheet files (meta_character and meta_dialogueEvent). Both the character names and dialogue scenes are anonymised, so if necessary you need to link the texts and their meta information.

Dataset_type1: inclusive of context texts (지문) Dataset_type2: exclusive of context texts

All conversations are between two persons.
Sometimes one of them says nothing but shows non-linguistic or paralinguistic expression. and tags represent this type of expression. You can remove or keep them as you want.

Column names of text data files

Column names of meta information file (meta_character)

No: serial number
serialID: individual scene indicator (match with the text data)
speaker: person who said ("context" means context text) (match with the text data)
gender: the gender of the speaker
age: the age group of the speaker
occupation: the job of the speaker

Column names of meta information file (meta_dialogeEvent)

No: serial number
serialID: individual scene indicator (match with the text data)
scene_description: the (specific) places where the conversaion takes place
place: the place where the conversaion takes place (categorised)
purpose: for what the speakers converse (categorised). The mojority of the conversations has no specific purpose.
othe_purpose: for what the speakers converse

About

Dialogue events collection derived from scripts of movies and dramas; for training and testing dialogue / chatbot systems

Creative Commons Attribution Share Alike 4.0 International