nefujiangping / entity_recognition

Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.


Components of Entity Recognition

Word Embedding

  • Static Word Embedding: word2vec, GloVe
  • Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation


  • sequence labeling (
    • CRF
    • softmax
  • predict start/end index of entities (_pointer)


According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

  • Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax):
  • (Static Word Embedding, ELMo) × BiLSTM × pointer: and

Other models can be implemented by adding/modifying few codes.

How to run

  1. Prepare data:
    1. download official competition data to data folder
    2. get sequence tagging train/dev/test data: bin/
    3. prepare vocab, tag
      • vocab: word vocabulary, one word per line, with word word_count format
      • tag: BIOES ner tag list, one tag per line (O in first line)
    4. follow the step 2 or 3 below
      • 2 is for models using static word embedding
      • 3 is for model using ELMo
  2. Run model with static word embedding, here take word2vec as an example:
    1. train word2vec: bin/
    2. modify
    3. run python [bilstm/dgcnn] [softmax/crf] or python (remember to modify config.model_name before a new run, or the old model will be overridden)
  3. Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
    1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
    2. modify
    3. run python

How to train a pure token-level ELMo from scratch?

  • Just follow the official instruction described here.
  • Some notes:
    • to train a token-level language model, modify bin/
      from vocab = load_vocab(args.vocab_file, 50)
      to vocab = load_vocab(args.vocab_file, None)
    • modify n_train_tokens
    • remove char_cnn in options
    • modify lstm.dim/lstm.projection_dim as you wish.
    • n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
  • After finishing the last step of the instruction, you can refer to the script to dump the dynamic sentence representations of your own dataset.



Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"


Language:Python 100.0%