ju-bezdek / conll2003-sk-ner

Translated version of original conll2003 dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

annotations_creators language_creators languages licenses multilinguality pretty_name size_categories source_datasets task_categories task_ids
machine-generated
expert-generated
found
sk
unknown
monolingual
conll-2003-sk-ner
10K<n<100K
extended|conll2003
structure-prediction
named-entity-recognition
part-of-speech-tagging

Dataset Card for [Dataset Name]

Table of Contents

Dataset Description

This is translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate) Annotation was done mostly automatically with word matching scripts. Records where some tags were not matched, were annotated manually (10%) Unlike the original Conll2003 dataset, this one contains only NER tags

Supported Tasks and Leaderboards

NER

labels:

  • 0: O
  • 1: B-PER
  • 2: I-PER
  • 3: B-ORG
  • 4: I-ORG
  • 5: B-LOC
  • 6: I-LOC
  • 7: B-MISC
  • 8: I-MISC

Languages

sk

Dataset Structure

Data Splits

train, test, val

Dataset Creation

Source Data

https://huggingface.co/datasets/conll2003

Annotations

Annotation process

  • Machine Translation
  • Machine pairing tags with reverse translation, and hardcoded rules (including phrase regex matching etc.)
  • Manual annotation of records that couldn't be automatically matched

About

Translated version of original conll2003 dataset


Languages

Language:Jupyter Notebook 98.5%Language:Python 1.5%