Unshared Task for the 3rd Workshop on Argument Mining, ACL 2016, Berlin
- This site contains supplementary data for the Unshared Task
- See the corresponding call for papers and visit the official workshop website.
Contact person: Ivan Habernal, habernal@ukp.informatik.tu-darmstadt.de
http://www.ukp.tu-darmstadt.de/
Content
We provide four variants of the task across various registers. Each variant is split into three parts:
- Development set: We encourage your to perform the exploratory analysis, task definition, annotation experiments, etc. on this set.
- Test set: This small set might serve as a benchmark for testing your annotation model (or even a computer model, if you go that far) and reporting agreement measures (if applicable).
- Crowdsourcing set: a bit larger set if you plan any crowdsourcing experiments
Note that these various splits are only a recommendation and not obligatory, you are absolutely free to use the entire dataset if your task requires so.
Variants
Variant A: Debate portals
- Samples from two-sided debate portal
createdebate.com
- Posts are identified by
#id
and are delimited by two empty lines - There are four types of posts: normal post, dispute of previous post, support of previous post, and clarification of previous post
- Empty line is a paragraph break
- Posts are identified by
- 8 devel files
- 2 test files
- 18 crowdsourcing files
Variant B: Debate transcript
- Two speeches (opening and closing) for both the proponent and the opponent from Intelligence squared debates
- The entire debate has more participants, but the entire transcript would be extremely long for the purposes of the unshared task
- 3 devel files
- 2 test files
- 5 crowdsourcing files
Variant C: Opinionated newswire article
- Editorial articles from Room for debate from N.Y.Times
- Each article has a debate title and debate description
- 8 devel files
- 2 test files
- 12 crowdsourcing files
Variant D: Discussion under opinionated articles
- Discussions from Room for debate
- The IDs of the debates correspond to IDs of the articles from Variant C (for instance,
Dd001.txt
has a corresponding articleCd001.txt
) - Each discussion starts with a debate title, debate description, and title of the corresponding article
- The IDs of the debates correspond to IDs of the articles from Variant C (for instance,
- 8 devel files
- 2 test files
- 12 crowdsourcing files
Data
Data are stored in plain text format (UTF-8 encoding). The name of each file consists of the variant name, the sub-set, and the number in the subset, for example
Bd002.txt
is a B
category file from the d
development set with number 2
.
License
See LICENSE.txt
or README.txt
in the particular folders.
Reproducibility
data/links.txt
contains the full list of URLs and their corresponding files. We used selenium-firefox-driver
and JSoup
for scraping the content of Room for Debate. See de.tudarmstadt.ukp.argumentation.data.roomfordebate.DataFetcher
for details.
Each file was also formatted to fit into 80 characters long lines, see src/bash/runFoldOnAllFiles.sh
.
Variant C and D were created semi-automatically, Variant A and B manually.