ialfina / ner-dataset-modified-dee

A dataset for Indonesian Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset for Building Indonesian NER

(Dataset untuk Membangun Named Entity Recognizer (NER) untuk Bahasa Indonesia)

This repository contains resources of a project named Modified DBpedia Entities Expansion (MDEE) (Alfina, et al., 2017).
We share:

  • Three NER datasets used in the experiments explained in the paper (in the main folder), each consists of 20,000 sentences, along with the gold standard.
  • Three NER datasets, as the revised version of the three NER datasets in the main folder (in the revised-20k folder).
  • The original names in Indonesian DBpedia (in "original-dbpedia" folder).
  • Two versions of DBpedia explained in the paper (in "expanded-dbpedia" folder): MDEE, and MDEE_Gazetteer
  • A dataset of 48,957 sentences named SINGGALANG (in "singgalang" folder). We used expanded DBpedia of MDEE_Gazetteer to label this dataset.

The NER Dataset

The datasets conforms with the dataset format of Stanford-NER.

Four named entity classes are used:

  • "Person" for person names
  • "Place" for place names
  • "Organisation" for organization names
  • "O" for others


List of dataset in main folder:

  1. dataset created using original DEE (Alfina et al., 2016), file name: 20k-dee.txt, with properties file: 20k-dee.prop
  2. dataset created using Modified DEE (Alfina et al., 2017), file name: 20k-mdee.txt, with properties file: 20k-mdee.prop
  3. dataset created using Modified DEE plus gazetteer (Alfina et al., 2017), file name: 20k-mdee-gazz.txt, with properties file: 20k-mdee-gazz.prop
  4. A gold standard created by Luthfi, et al (2014)

Each version of NER datasets consist of 20,000 sentences from Wikipedia articles in the Indonesian language that were labeled automatically.

The SINGGALANG dataset

We provide a new NER dataset in this repository, named SINGGALANG. The specifications of this dataset are:

  • The number of sentences: 48,957
  • Generated using expanded DBpedia of MDEE_Gazett (the best version of those three expanded DBpedia)

References

The dataset may be used for free, but if you want to publish paper/publication using the dataset, please cite these publications:

How to create NER model using this dataset?

We suggest you to use the Stanford NER library.
The steps to create NER model using Stanford NER library are as follows:

  1. Download Stanford-NER.

  2. Download the dataset and its properties file (file with .prop extension)

  3. Use Stanford NER classifier to create the model.
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.prop

    I recommend to increase the heap size so you can train the dataset on computer with limited RAM. Add option like "-Xmx1024m" on the command, for example:

    java -Xmx1024m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop 20k-mdee.prop

    if this still doesn't work, increase the number. For example: "-Xmx8000m". This works for me :)

    Let say this step will create a NER model file named "idner-model-20k-mdee.ser.gz"

  4. Create or use a testing dataset. Lets say the file name is "testing.txt"

  5. Evaluate the NER model using Stanford NER library
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier idner-model-20k-mdee.ser.gz -testFile testing.txt

About

A dataset for Indonesian Named Entity Recognizer