ncg-task / training-data

Training data for the NLPContributionGraph Shared Task 11 at SemEval-2021

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

training-data

Training Data for the NLPContributionGraph Shared Task 11 at SemEval-2021

The repository is organized as follows:

README.md                            
[task-name-folder]/                                # natural_language_inference, paraphrase_generation, question_answering, relation_extraction, topic_models
    ├── [article-counter-folder]/                  # ranges between 0 to 100 since we annotated varying numbers of articles per task
    │   ├── [articlename].pdf                      # scholarly article pdf
    │   ├── [articlename]-Grobid-out.txt           # plaintext output from the [Grobid parser](https://github.com/kermitt2/grobid)
    │   ├── [articlename]-Stanza-out.txt           # plaintext preprocessed output from [Stanza](https://github.com/stanfordnlp/stanza)
    │   ├── sentences.txt                          # annotated Contribution sentences in the file
    │   ├── entities.txt                           # annotated entities in the Contribution sentences
    │   └── info-units/                            # the folder containing information units in JSON format
    │   │   └── research-problem.json              # `research problem` mandatory information unit in json format
    │   │   └── model.json                         # `model` information unit in json format; in some articles it is called `approach`
    │   │   └── ...                                # there are 12 information units in all and each article may be annotated by 3 or 6
    │   └── triples/                               # the folder containing information unit triples one per line
    │   │   └── research-problem.txt               # `research problem` triples (one research problem statement per line)
    │   │   └── model.txt                          # `model` triples (one statement per line)
    │   │   └── ...                                # there are 12 information units in all and each article may be annotated by 3 or 6
    │   └── ...                                    # there are between 1 to 100 articles annotated for each task, so this repeats for the remaining annotated articles
    └── ...                                        # there are 24 tasks selected overall, so this repeats 23 more times

Data Element Counts

Total Papers Annotated: 237

Tasks info-units sentences entities total triples total unique triples subject predicate object
1 natural_language_inference 427 2168 12657 7969 7330 3171 1251 5242
2 negation_scope_resolution 4 28 163 94 94 50 42 80
3 paraphrase_generation 9 44 293 177 175 99 77 160
4 part-of-speech_tagging 36 144 804 501 479 249 156 401
5 passage_re-ranking 8 32 214 126 123 63 66 103
6 phrase_grounding 5 29 172 102 102 58 53 94
7 prosody_prediction 5 31 172 105 103 58 43 97
8 query_wellformedness 5 11 54 35 35 22 25 33
9 question_answering 30 194 1059 665 640 332 203 547
10 question_generation 7 34 133 87 87 45 44 74
11 question_similarity 4 16 82 51 51 30 26 49
12 relation_extraction 69 346 1923 1154 1084 552 372 922
13 sarcasm_detection 10 40 225 138 136 77 73 116
14 semantic_parsing 12 60 275 183 180 91 74 157
15 semantic_role_labeling 22 100 545 338 318 163 137 288
16 sentence_classification 15 85 513 300 297 167 134 273
17 sentence_compression 19 77 426 260 248 138 104 223
18 sentiment_analysis 240 1275 7452 4517 4086 1864 940 2967
19 smile_recognition 3 17 85 54 54 29 34 49
20 temporal_information_extraction 8 26 152 94 93 58 62 85
21 text-to-speech_synthesis 13 69 316 197 192 103 98 174
22 text_generation 24 129 704 431 420 222 165 351
23 text_summarization 70 347 1777 1077 1010 513 346 825
24 topic_models 5 18 75 48 48 30 28 48

Note

For system training, participants are encouraged to merge the 50 files additionally from the trial-data release.

About

Training data for the NLPContributionGraph Shared Task 11 at SemEval-2021