amore-upf / ted-q

Repository for the TED-Q dataset: TED Talks and the Questions they Evoke.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TED-Q: TED Talks and the Questions they Evoke

While reading a TED-talk transcript, what are people wondering about?

This repository contains the dataset for our paper TED-Q: TED Talks and the Questions they Evoke.

!! NEW !!

You may find more convenient representations of the same dataset here, especially for investigating the alignment of questions and discourse relations: TED-QDB.

Contents of this repository

See further below for an explanation of the structure of these .csv files.

  • TED-Q_elicitation.csv: Data from our elicitation phase: evoked questions and their (non-)answers.
  • TED-Q_comparison_raw.csv: Data from our comparison phase: how related are the evoked questions to each other -- individual annotator's judgments per question pair.
  • TED-Q_comparison_aggregated.csv: Data from our comparison phase, aggregated annotator's judgments per question pair (mean).

Download the source texts

TED-Q provides an additional layer of annotations to the existing TED-MDB dataset. The source texts are not included in the current repository; download them here:

https://github.com/MurathanKurfali/Ted-MDB-Annotations

Or here (forked):

https://github.com/amore-upf/Ted-MDB-Annotations


Attribution

If you use this resource, please cite our LREC paper:

@inproceedings{westera2019lrec,
  title={TED-Q: TED Talks and the Questions they Evoke},
  author={Matthijs Westera and Laia Mayol and Hannah Rohde},
  booktitle = "Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC'2020)",
  year = 	 "2020",
  month = 	 "May",
  date =     "13-15",
  address =  "Marseille, France",
  publisher = "European Language Resource Association (ELRA)",
}

And consider citing also the authors of the TED-MDB dataset, whose source texts we used:

@article{zeyrek2019ted,
  title={TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style},
  author={Zeyrek, Deniz and Mendes, Amalia and Grishina, Yulia and Kurfali, Murathan and Gibbon, Samuel and Ogrodniczuk,    Maciej},
  journal={Language Resources and Evaluation},
  pages={1--38},
  year={2019},
  publisher={Springer}
}

@inproceedings{zeyrek2018multilingual,
  title={Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank.},
  author={Zeyrek, Deniz and Mendes, Amalia and Kurfali, Murathan},
  booktitle={LREC},
  year={2018}
}

Structure of the .csv files

TED-Q_elicitation.csv:

  • excerpt_number: the number of excerpts (up to 6) this annotator has seen including the current one.

  • chunk_number: the number of chunks (up to 8) this annotator has seen within this excerpt, including the current one.

  • worker: a made-up name uniquely identifying the annotator.

  • type: the type of annotation, among 'question', 'answer' ('answered' score >= 3), 'non-answer' ('answered' score <= 2), or 'evaluation' (some meta-questions at the end of each fragment).

  • content (for 'question'/'answer' type annotations only): the question/answer as formulated by the annotator in their own words.

  • answered: for 'answer'/'non-answer' type annotations, the degree to which it provided an answer to the given question ('prior_question'); for 'question' type annotations, the maximal degree to which it was answered.

  • highlight (for 'question'/'answer') type annotations only): the words selected by the person as either triggering the question or providing the answer.

  • prior_question (for 'answer'/'non-answer' type annotations only): annotation id of the question to which the current chunk provides a (non-)answer.

  • best_answer (for 'question' type annotations only): annotation id of its best answer.

  • coherence/naturalness/comment (for 'evaluation' type annotations only): after every fragment (around 8 chunks per fragment) we asked participants whether the text was coherent and natural (scales from 1 to 5), and provided an open text field for comments.

  • relatedness (for 'question' type annotations only): how related a question is, on average, to other questions elicited by the same chunk (according to aggregated verification data).

  • source: identifier of the source text, assuming the directory structure in the Ted-MDB-Annotations github repo (see URL above).

  • chunk_start/chunk_end: the start/end position (by number of characters) in the source text of the chunk (two sentences) presented to the annotator when eliciting the annotation.

  • highlight_start/highlight_end (for 'question'/'answer' type annotations only): the start/end position (by number of characters) in the source text of the phrase highlighted by the annotator (depending on 'type': the trigger of the question, or the part providing the answer)

TED-Q_comparison_raw.csv

We asked annotators to judge how related two questions were given the context that evoked them.

  • workerid: anonymized identifier of the annotator

  • snippet: presented to annotators for judging question relatedness in context, roughly two sentences from the source text, including the chunk that evoked the questions to be judged.

  • target_question: the target question

  • comparison_question: the question which they were asked to compare to the target question

  • relatedness: how related they judged the two questions to be, from 0 (not closely related) to 3 (equivalent)

  • target_question_id: annotation id of the target question (for linking to elicitation data)

  • comparison_question_id: annotation id of the comparison question (for linking to elicitation data)

TED-Q_comparison_aggregated.csv

We aggregated relatedness judgments by taking the mean, conflating target/comparison pairs in either order (making for ~6 judgments per pair):

  • question1_id / question2_id: annotation id of the questions (for linking to elicitation data).

  • snippet: the snippet of text against which question relatedness was judged (as above).

  • question1 / question2: the questions in plain text.

  • relatedness_mean: mean of the individual judgments for this pair.

  • relatedness_count: how many individual judgments for this pair.

  • relatedness_list: list containing the individual judgments for this pair.

  • relatedness_std: standard deviation among the individual judgments for this pair.

About

Repository for the TED-Q dataset: TED Talks and the Questions they Evoke.


Languages

Language:JavaScript 51.0%Language:HTML 28.5%Language:Python 18.0%Language:CSS 2.0%Language:Shell 0.3%Language:Makefile 0.1%Language:TeX 0.1%