NYTK-NerKor

The home repository of the NYTK-NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.

License and usage

The corpus creation was funded by the Research Centre for Linguistics (Nyelvtudományi Kutatóközpont, NYTK). The project leaders were Eszter Simon and Noémi Vadász.

The corpus is available under the license CC-BY-SA 4.0. If you use this corpus, please mention this GitHub repo with URL (we do not have a published paper yet).

Data

Corpus files are under the 'data' folder. There are two subfolders: the 'genres' subfolder contains the data files grouped by genre: fiction, legal, news, web, wikipedia; while the 'train-devel-test' subfolder contains symlinks to the original data files.

A ~200,000 tokens subcorpus contains gold standard morphological annotation besides NE labels.

The proportion of train, devel and test sets is around 80%-10%-10%. All sets provide a balanced selection from all genres and sources. The morphologically annotated subcorpus is also represented in all sets in a balanced way. For exact numbers, see the train-devel-test table below.

The fiction subcorpus contains i) novels from MEK (Hungarian Electronic Library) and Project Gutenberg; and ii) subtitles from OpenSubtitles.

The legal texts come from EU sources: it is a selection from the EU Constitution, documents from the European Economic and Social Committee, DGT-Acquis and JRC-Acquis.

The sources of the news subcorpus are: Press Release Database of European Commission, Global Voices and NewsCrawl Corpus.

Web texts contain a selection from the Hungarian Webcorpus 2.0.

Wikipedia texts are from the Hungarian Wikipedia. :)

Token numbers

genre	morph/no-morph	file	sentence	token
fiction	morph	0	0	0
	no-morph	122	24535	203216
	sum	122	24535	203216
legal	morph	0	0	0
	no-morph	39	7632	202195
	sum	39	7632	202195
news	morph	35	477	9178
	no-morph	47	9280	204478
	sum	82	9757	213656
web	morph	398	10886	188250
	no-morph	0	0	0
	sum	398	10886	188250
wikipedia	morph	85	1618	26764
	no-morph	72	13096	194033
	sum	157	14714	220797
altogether	morph	518	12981	224192
	no-morph	280	54543	803922
	sum	798	67524	1028114

NE labels and density

genre	morph/no-morph	PER	LOC	ORG	MISC	NE	NE density
fiction	morph	0	0	0	0	0
	no-morph	5224	1042	217	287	6770	0,03331430596
	sum	5224	1042	217	287	6770	0,03331430596
legal	morph	0	0	0	0	0
	no-morph	255	1302	6840	1871	10268	0,0507826603
	sum	255	1302	6840	1871	10268	0,0507826603
news	morph	220	168	183	63	634	0,06907823055
	no-morph	4368	2161	5111	3636	15276	0,07470730348
	sum	4588	2329	5294	3699	15910	0,07446549594
web	morph	2826	1343	1788	2434	8391	0,04457370518
	no-morph	0	0	0	0	0
	sum	2826	1343	1788	2434	8391	0,04457370518
wikipedia	morph	571	400	203	324	1498	0,05597070692
	no-morph	8321	8714	5159	3929	26123	0,1346317379
	sum	8892	9114	5362	4253	27621	0,1250968084
altogether	morph	3617	1911	2174	2821	10523	0,04693744647
	no-morph	18168	13219	17327	9723	58437	0,07268988782
	sum	21785	15130	19501	12544	68960	0,06707427386

Train-devel-test sets

genre	morph/no-morph	train	devel	test
fiction	morph	0	0	0
	no-morph	161505	20884	20827
	sum	161505	20884	20827
legal	morph	0	0	0
	no-morph	157710	22552	21933
	sum	157710	22552	21933
news	morph	7314	935	929
	no-morph	163780	19848	20850
	sum	171094	20783	21779
web	morph	150762	18724	18764
	no-morph	0	0	0
	sum	150762	18724	18764
wikipedia	morph	21331	2679	2754
	no-morph	154574	20074	19385
	sum	175905	22753	22139
altogether	morph	179407	22338	22447
	no-morph	637569	83358	82995
	sum	816976	105696	105442

Data format

The format of data files are CoNLL-U Plus with the standard .conllup file extension. The first line in each file is: # global.columns = FORM LEMMA UPOS XPOS FEATS CONLL:NER, where:

FORM: the token itself;

LEMMA: the lemma of the token;

UPOS: UD POS tags;

XPOS: full morphological annotation (POS + meorphosyntactic features) provided by emMorph;

FEATS: UD morphosyntactic features;

CONLL:NER: NE annotation.

The NE annotation follows the CoNLL2002 labelling standard. The four NE categories are: PER, LOC, MISC, ORG. The tags are in the IOB2 format: a B- prefix denotes the first item of a NE phrase and an I- prefix any non-initial word. Non-names are marked by an O label.

Guidelines

Annotation guidelines, WebAnno guidelines and Annotation scheme are available in the Guidelines folder. (Only in Hungarian.)

About

The home repository of the NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.

Creative Commons Attribution Share Alike 4.0 International

Languages

Language:Shell 100.0%