dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about dataset preprocessing

wasiahmad opened this issue · comments

In the documentation, there is two dataset preprocessing steps. One for entity and relations and the second one is for events. In the first task, Stanford Corenlp is used, but in the second task, Spacy is used. Can you please explain, what is the difference? I see relation labels are different in these preprocessing steps, such as, "ORG-AFF.Membership" or "GEN-AFF" and their offset values are different too. There are other differences too. It would be helpful if you provide some details.

Since ACE05 is a benchmark dataset, I assume, token/entity/relation/event annotation is already there. Then why do you need Corenlp or Spacy libraries?

Good questions. As far as I know, the ACE dataset in its original release does not actually split the data into tokens or sentences. It does provide spans for entities, relations, and events, but there is ambiguity there also as described in DATA.md. In general, people do their own preprocessing. Part of the motivation for this code release was to offer a standardized way of preprocessing the ACE data for others to use.

Historically, the community has used different train / dev / test splits when doing ACE relation extraction vs. ACE event extraction. For the details, see section 3 of the dygiepp paper. For relation extraction we use the split from Miwa and Bansal and for event extraction we use the split from Yang and Mitchell.

For the relation split, I use an adapted version of the preprocessing code from the Miwa and Bansal codebase. This code relies on Stanford CoreNLP.

For the event split, I used code from this paper. The code itself is not publicly released, but the author shared code with me and I adapted it and made it public. This code relies on Spacy.

Thank you for clarifying everything. I assume CoreNLP and Spacy are used for tokenization and sentence splitting.

Opening this issue to ask a question regarding event extraction data. As noted in the dataset description here, from ACE05 we have entities, relations, and coreference clusters.

Then where is the ground-truth for event triggers, event arguments, and their role-labels? Can you please explain this?

Added to DATA.md. Thanks for the question.