nefujiangping / entity_recognition

Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.

Requirements

Components of Entity Recognition

Word Embedding

  • Static Word Embedding: word2vec, GloVe
  • Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation

Inference

  • sequence labeling (sequence_labeling.py)
    • CRF
    • softmax
  • predict start/end index of entities (_pointer)

Note

According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

  • Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax): sequence_labeling.py
  • (Static Word Embedding, ELMo) × BiLSTM × pointer: bilstm_pointer.py and bilstm_pointer_elmo.py

Other models can be implemented by adding/modifying few codes.

How to run

  1. Prepare data:
    1. download official competition data to data folder
    2. get sequence tagging train/dev/test data: bin/trans_data.py
    3. prepare vocab, tag
      • vocab: word vocabulary, one word per line, with word word_count format
      • tag: BIOES ner tag list, one tag per line (O in first line)
    4. follow the step 2 or 3 below
      • 2 is for models using static word embedding
      • 3 is for model using ELMo
  2. Run model with static word embedding, here take word2vec as an example:
    1. train word2vec: bin/train_w2v.py
    2. modify config.py
    3. run python sequence_labeling.py [bilstm/dgcnn] [softmax/crf] or python bilstm_pointer.py (remember to modify config.model_name before a new run, or the old model will be overridden)
  3. Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
    1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
    2. modify config.py
    3. run python bilstm_pointer_elmo.py

How to train a pure token-level ELMo from scratch?

  • Just follow the official instruction described here.
  • Some notes:
    • to train a token-level language model, modify bin/train_elmo.py:
      from vocab = load_vocab(args.vocab_file, 50)
      to vocab = load_vocab(args.vocab_file, None)
    • modify n_train_tokens
    • remove char_cnn in options
    • modify lstm.dim/lstm.projection_dim as you wish.
    • n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
  • After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.

References

About

Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"


Languages

Language:Python 100.0%