Unshared Task for the 3rd Workshop on Argument Mining, ACL 2016, Berlin

This site contains supplementary data for the Unshared Task
See the corresponding call for papers and visit the official workshop website.

Contact person: Ivan Habernal, habernal@ukp.informatik.tu-darmstadt.de

http://www.ukp.tu-darmstadt.de/

Content

We provide four variants of the task across various registers. Each variant is split into three parts:

Development set: We encourage your to perform the exploratory analysis, task definition, annotation experiments, etc. on this set.
Test set: This small set might serve as a benchmark for testing your annotation model (or even a computer model, if you go that far) and reporting agreement measures (if applicable).
Crowdsourcing set: a bit larger set if you plan any crowdsourcing experiments

Note that these various splits are only a recommendation and not obligatory, you are absolutely free to use the entire dataset if your task requires so.

Variants

Variant A: Debate portals

Samples from two-sided debate portal createdebate.com
- Posts are identified by #id and are delimited by two empty lines
- There are four types of posts: normal post, dispute of previous post, support of previous post, and clarification of previous post
- Empty line is a paragraph break
8 devel files
2 test files
18 crowdsourcing files

Variant B: Debate transcript

Two speeches (opening and closing) for both the proponent and the opponent from Intelligence squared debates
- The entire debate has more participants, but the entire transcript would be extremely long for the purposes of the unshared task
3 devel files
2 test files
5 crowdsourcing files

Variant C: Opinionated newswire article

Editorial articles from Room for debate from N.Y.Times
- Each article has a debate title and debate description
8 devel files
2 test files
12 crowdsourcing files

Variant D: Discussion under opinionated articles

Discussions from Room for debate
- The IDs of the debates correspond to IDs of the articles from Variant C (for instance, Dd001.txt has a corresponding article Cd001.txt)
- Each discussion starts with a debate title, debate description, and title of the corresponding article
8 devel files
2 test files
12 crowdsourcing files

Data

Data are stored in plain text format (UTF-8 encoding). The name of each file consists of the variant name, the sub-set, and the number in the subset, for example Bd002.txt is a B category file from the d development set with number 2.

License

See LICENSE.txt or README.txt in the particular folders.

Reproducibility

data/links.txt contains the full list of URLs and their corresponding files. We used selenium-firefox-driver and JSoup for scraping the content of Room for Debate. See de.tudarmstadt.ukp.argumentation.data.roomfordebate.DataFetcher for details. Each file was also formatted to fit into 80 characters long lines, see src/bash/runFoldOnAllFiles.sh. Variant C and D were created semi-automatically, Variant A and B manually.

JILU1111 / argmin2016-unshared-task