Word Representations for Named Entity Recognition

This repository contains the code code introduced by Ratinov et al. (2009) and subsequently used by Turian et al. (2010) and Stenetorp et al. (2012) with some minor patches applied and instructions for replicating the results. It also benefits from being slightly slimmer than the original distribution by stripping out duplicates of the same code and resources.

Word Representations

The word representations used by Turian et al. (2010) are available from here.

The word representations used by Stenetorp et al. (2012) are available here in a format that is suitable for the NER task.

Replicating the Experiments

With the data and code provided on this page you can replicate the experiments conducted for Stenetorp et al. (2012). In case you have problems please let us know.

Setup

After cloning this repository you will have to download the wordrepresentations and the corpora that were used for evaluation and unpack them to their respective folders.

Please download and unpack the archive containing formatted versions of the AnEM, Disease and GeneTag corpora from here. Organize them in the following way in your local data folder.

data
├── ...
├── GoldData
│   ├── AnEM
│   │   ├── Dev
│   │   ├── Test 
│   │   └── Train
│   ├── Disease
│   │   ├── Dev
│   │   ├── Test
│   │   └── Train
│   └── GeneTag
│       ├── Dev
│       ├── Test 
│       └── Train
└── ...

Next, download the NER-Experiments folder provided by Turian et al. (2010) from here and copy the folders "data/WordEmbedding" and "data/BrownHierarchicalWordClusters" to your data folder.
Move the current content of your "data/BrownHierarchicalWordClusters" folder to a new subfolder called "data/BrownHierarchicalWordClusters/outDomain".

Finally download the representations introduced by Stenetorp et al. (2012) from here and unpack the Brown-clusters to "data/BrownHierarchicalWordClusters/inDomain" and the distributed word representations to "data/WordEmbedding".

In the end you should have the following structure in your data folder.

data
├── BrownHierarchicalWordClusters
│   ├── inDomain
│   │   ├── c1000.txt
│   │   ├── c100.txt
│   │   ├── c150.txt
│   │   ├── c320.txt
│   │   ├── c500.txt
│   └── outDomain
│       ├── brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.lowerCase.txt
│       ├── brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
│       ├── brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.upperCase.txt
│       ├── brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
│       ├── brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt
│       └── brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
├── GoldData
│   └── ...
├── Models
└── WordEmbedding
    ├── david.txt
    ├── google-phrasal-clusers.txt
    ├── hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt
    ├── hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt
    ├── model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt
    ├── model-2030000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=100.txt
    ├── model-2280000000.LEARNING_RATE=1e-08.EMBEDDING_LEARNING_RATE=1e-07.EMBEDDING_SIZE=25.txt
    └── model-2280000000.LEARNING_RATE=1e-08.EMBEDDING_LEARNING_RATE=1e-07.EMBEDDING_SIZE=50.txt

Running an experiment

Each experiment has one corresponding config file that specifies its parameters. The config files are located in the "config" directory and named according to the following scheme:

{CorpusName}-[(bio)|(news)]+-domain-{wordreptype-info}+.config

Therefore, the names start with a corpus-identifier followed by one or multiple occurrences of "bio" or "news" followed by "-domain-" and one or multiple occurrences of wordrepresentation identifiers. Each "bio"/"news" corresponds to one wordreptype-info and indicates whether a wordrepresnentation was induced on bio or news data. The first "bio"/"news" corresponds to the first wordreptype-info, the second to the second wordreptype-info and so on.

Corpus identifiers:
    AnEM = AnEM
    Disease = NCBID
    GeneTag = BC2GM

Example:

GeneTag-bio-news-domain-clarkne-hlbl.config

This means: The experiment will be conducted on the GeneTag(BC2GM) corpus and ClarkNE representations induced on bio-data will be used in combination with HLBL representations induced on newswire data.

All parameters available in the config files are identical to the ones from Turian et al. (2010) with two exceptions. The newly introduced parameter "pathToTrainDevTest" specifies the directory where your corpus Train/Dev/Test split is avaiable. The "isUppercaseWordEmbeddings" parameter specifies for each used wordrepresentation files whether it is uppercased only.

Before you can run your first experiment, you will have to run

./cleanCompile

from within your smbm-ner-experiments folder.

Now you can start experiments by using modified versions of the example command below. Note that the command below must be executed from within your smbm-ner-experiments folder.

Example program call:

nohup nice java -Xmx4000m -classpath LBJ2.jar:LBJ2Library.jar:bin:stanford-ner.jar:stanford-ner.src.jar:lucene-core-2.4.1.jar \
                ExperimentsSMBM/PerformExperimentGivenConfig  \
                ../config/AnEM-bio-domain-brown-c1000.config \
                > smbm.contra.result.AnEM-bio-domain-brown-c1000.txt  &

You may need to allow for more than 4000m of memory especially for the Google-Phrase-Cluster runs.

Citing

If you use the code without any word representations please cite:

@InProceedings{ratinov2009design,
  author    = {Ratinov, Lev  and  Roth, Dan},
  title     = {Design Challenges and Misconceptions
      in Named Entity Recognition},
  booktitle = {Proceedings of the Thirteenth Conference
      on Computational Natural Language Learning (CoNLL-2009)},
  month     = {June},
  year      = {2009},
  address   = {Boulder, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {147--155},
}

If you use the code and the word representations in combination with the Turian et al. (2010) embeddings please cite:

@InProceedings{turian2010word,
  author    = {Turian, Joseph  and  Ratinov, Lev-Arie
      and  Bengio, Yoshua},
  title     = {Word Representations: A Simple and General Method
      for Semi-Supervised Learning},
  booktitle = {Proceedings of the 48th Annual Meeting of the Association
      for Computational Linguistics},
  month     = {July},
  year      = {2010},
  address   = {Uppsala, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {384--394},
}

If you use the code and the word representations in combination with the Stenetorp et al. (2012) embeddings please cite:

@inproceedings{stenetorp2012size,
    author      = {Stenetorp, Pontus and Soyer, Hubert and Pyysalo, Sampo
        and Ananiadou, Sophia and Chikayama, Takashi},
    title       = {Size (and Domain) Matters: Evaluating Semantic Word
        Space Representations for Biomedical Text},
    year        = {2012},
    booktitle   = {Proceedings of the 5th International Symposium on
        Semantic Mining in Biomedicine},
}

ogh / wordreprs_ner

Word Representations for Named Entity Recognition