JuqiangJ / CENMS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CENMS-Dataset CENMS: the first Chinese Emergency News Mutil-document Summarization Dataset

Introduction Multi-document summarization(MDS) is a fundamental natural language processing application. However, MDS suffers from the lack of of datasets with credible references of the multiple topic-related documents and the models trained in certain domain do not generalize to other ones.

The main causes for this problem include the high cost of human-written references construction after analyzing the multiple documents by domain knowledge and the complexity of related documents collection with low redundancy. To this end, our research proposes an automatic method for MDS dataset construction with domain-aware strategies and built the dataset CENMS taking emergency news as example.

The CENMS dataset is introduced : the first Chinese Emergency News Mutil-document Summarization Dataset.

Properties CENMS contains more than 20K summaries clusters and covers four major categories, including natural disaster, accident, public health and society security. Four major categories can be devived into 39 sub-topics from COVID-19 to earthquake.

Sub-topic number
Blizzard 30
Other Disasters 7
Earthquake 1560
Drought 20
Flood 673
Fog 149
Forest Fire 16
Ice storm 35
Landslide 549
Mudslide 367
Rainstorm 51
Sand Storm 78
Thunderstroke 120
Tornado 129
Tsunami 15
Typhoon 1764
Air Crash 125
Collapse 121
Explosion 1374
Fire Crash 2768
Gas-leak 6
Nuclear Leak 3
Shipwreck 9
Traffic Accident 274
COVID-19 8209
Dengue 33
Avian Influenza 127
Ebola 107
MERS 55
African Swine fever 224
HIV 46
Food-poisoning 10
Zika Virus 20
Pandemic 864
Arson 34
Other Crimes 156
Terrorist 100
Drugs 859
Fraud 59

Samples We select some samples from the dataset and you can see them in samples.csv.

Download We split the corpus into three parts, including training, validation and test set. If you need CENMS for further research, please send application to the e-mail adamlau90@hotmail.com for request.

About