JILU1111 / argmin2016-unshared-task

Supplementary data for the Unshared Task at the 3rd Argument Mining workshop, ACL 2016

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unshared Task for the 3rd Workshop on Argument Mining, ACL 2016, Berlin

Contact person: Ivan Habernal, habernal@ukp.informatik.tu-darmstadt.de

http://www.ukp.tu-darmstadt.de/

http://www.tu-darmstadt.de/

Content

We provide four variants of the task across various registers. Each variant is split into three parts:

  • Development set: We encourage your to perform the exploratory analysis, task definition, annotation experiments, etc. on this set.
  • Test set: This small set might serve as a benchmark for testing your annotation model (or even a computer model, if you go that far) and reporting agreement measures (if applicable).
  • Crowdsourcing set: a bit larger set if you plan any crowdsourcing experiments

Note that these various splits are only a recommendation and not obligatory, you are absolutely free to use the entire dataset if your task requires so.

Variants

Variant A: Debate portals

  • Samples from two-sided debate portal createdebate.com
    • Posts are identified by #id and are delimited by two empty lines
    • There are four types of posts: normal post, dispute of previous post, support of previous post, and clarification of previous post
    • Empty line is a paragraph break
  • 8 devel files
  • 2 test files
  • 18 crowdsourcing files

Variant B: Debate transcript

  • Two speeches (opening and closing) for both the proponent and the opponent from Intelligence squared debates
    • The entire debate has more participants, but the entire transcript would be extremely long for the purposes of the unshared task
  • 3 devel files
  • 2 test files
  • 5 crowdsourcing files

Variant C: Opinionated newswire article

  • Editorial articles from Room for debate from N.Y.Times
    • Each article has a debate title and debate description
  • 8 devel files
  • 2 test files
  • 12 crowdsourcing files

Variant D: Discussion under opinionated articles

  • Discussions from Room for debate
    • The IDs of the debates correspond to IDs of the articles from Variant C (for instance, Dd001.txt has a corresponding article Cd001.txt)
    • Each discussion starts with a debate title, debate description, and title of the corresponding article
  • 8 devel files
  • 2 test files
  • 12 crowdsourcing files

Data

Data are stored in plain text format (UTF-8 encoding). The name of each file consists of the variant name, the sub-set, and the number in the subset, for example Bd002.txt is a B category file from the d development set with number 2.

License

See LICENSE.txt or README.txt in the particular folders.

Reproducibility

data/links.txt contains the full list of URLs and their corresponding files. We used selenium-firefox-driver and JSoup for scraping the content of Room for Debate. See de.tudarmstadt.ukp.argumentation.data.roomfordebate.DataFetcher for details. Each file was also formatted to fit into 80 characters long lines, see src/bash/runFoldOnAllFiles.sh. Variant C and D were created semi-automatically, Variant A and B manually.

About

Supplementary data for the Unshared Task at the 3rd Argument Mining workshop, ACL 2016


Languages

Language:Java 99.2%Language:Shell 0.8%