An implementation of LSTM+CRF model for Sequence labeling tasks. Based on Tensorflow(>=r1.1), and support multiple architecture like LSTM+CRF, BiLSTM+CRF, and combination of character-level CNN and BiLSTM+CRF.
Other architecture of RNN+CRF, like traditional feature involved architecture will be adding after.
Because this project used Tensorflow API, it requires installation of Tensorflow and some other python modules:
- Tensorflow ( >= r1.1)
Both of them can be easily installed by pip
.
The data format is basically consistent with the CRF++ toolkit. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. (So far, this program only supports data with 3-columns.)
To identify the boundary between sentences, an empty line is put.
Here's an example of such a file: (data for Chinese NER)
...
感 O
动 O
了 O
李 B-PER.NAM
开 I-PER.NAM
复 I-PER.NAM
感 O
动 O
回 O
复 O
支 O
持 O
...
This part you can read the readme file under lib directory, which is a submodule named NeuralTextProcess.
In file template
specificated the feature template which used in context-based feature extraction. The second line fields
indicates the field name for each column of a token. And the templates
described how to extract features.
For example, the basic template is:
# Fields(column), w,y&F are reserved names
w y
# templates.
w:-2
w:-1
w: 0
w: 1
w: 2
it means, each token will only has 2 columns data, 'w' and 'y'. Field y
should always be at the last column.
Note that
w
y
&F
fields are reserved, because program used them to represent word, label and word's features.Each token will become a dict type data like '{'w': '李', 'y': 'B-PER.NAM', 'F': ['w[-2]=动', 'w[-1]=了', ...]}'
The above templates
describes a classical context feature template:
- C(n) n=-2,-1,0,1,2
'C(n)' is the value of token['w'] at relative position n.
If your token has more than 2 columns, you may need change the fields and template depends on how you want to do extraction.
In this project, I disabled suffix of feature to extract words in a context window.
This program supports pretrained embeddings input. When running this program, you should give a embedding text file(word2vec tool standard output format) by specific argument.
In env_settings.py file, there are some environment settings like 'output dir':
# Those are some IO files' dirs
# you need change the BASE_DIR on your own PC
BASE_DIR = r'project dir/'
MODEL_DIR = BASE_DIR + r'models/'
DATA_DIR = BASE_DIR + r'data/'
EMB_DIR = BASE_DIR + r'embeddings/'
OUTPUT_DIR = BASE_DIR + r'export/'
LOG_DIR = BASE_DIR + r'Summary/'
If your don't have those dirs in your project dir, just run python env_settings.py
, and they will be created automatically.
Just run the ./main.py file. Or specify some arguments if you need, like this:
python main.py --lr 0.005 --fine_tuning False --l2_reg 0.0002
Then the model will run on lr=0.005, not fine-tuning, l2_reg=0.0002 and all others default. Using -h
will print all help informations. Some arguments are not useable now, but I will fix it as soon as possible.
python main.py -h
usage: main.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
[--valid_data VALID_DATA] [--log_dir LOG_DIR]
[--model_dir MODEL_DIR] [--model MODEL]
[--restore_model RESTORE_MODEL] [--emb_file EMB_FILE]
[--emb_dim EMB_DIM] [--output_dir OUTPUT_DIR]
[--only_test [ONLY_TEST]] [--noonly_test] [--lr LR]
[--dropout DROPOUT] [--fine_tuning [FINE_TUNING]]
[--nofine_tuning] [--eval_test [EVAL_TEST]] [--noeval_test]
[--max_len MAX_LEN] [--nb_classes NB_CLASSES]
[--hidden_dim HIDDEN_DIM] [--batch_size BATCH_SIZE]
[--train_steps TRAIN_STEPS] [--display_step DISPLAY_STEP]
[--l2_reg L2_REG] [--log [LOG]] [--nolog] [--template TEMPLATE]
optional arguments:
-h, --help show this help message and exit
--train_data TRAIN_DATA
Training data file
--test_data TEST_DATA
Test data file
--valid_data VALID_DATA
Validation data file
--log_dir LOG_DIR The log dir
--model_dir MODEL_DIR
Models dir
--model MODEL Model type: LSTM/BLSTM/CNNBLSTM
--restore_model RESTORE_MODEL
Path of the model to restored
--emb_file EMB_FILE Embeddings file
--emb_dim EMB_DIM embedding size
--output_dir OUTPUT_DIR
Output dir
--only_test [ONLY_TEST]
Only do the test
--noonly_test
--lr LR learning rate
--dropout DROPOUT Dropout rate of input layer
--fine_tuning [FINE_TUNING]
Whether fine-tuning the embeddings
--nofine_tuning
--eval_test [EVAL_TEST]
Whether evaluate the test data.
--noeval_test
--max_len MAX_LEN max num of tokens per query
--nb_classes NB_CLASSES
Tagset size
--hidden_dim HIDDEN_DIM
hidden unit number
--batch_size BATCH_SIZE
num example per mini batch
--train_steps TRAIN_STEPS
trainning steps
--display_step DISPLAY_STEP
number of test display step
--l2_reg L2_REG L2 regularization weight
--log [LOG] Whether to record the TensorBoard log.
--nolog
--template TEMPLATE Feature templates
There has three type of model can be choosed by using argument '--model', they are:
- LSTM + CRF
- BiLSTM + CRF
- CNN + BiLSTM + CRF
If you set 'only_test' to True or 'train_steps' to 0, then program will only do test process.
So you must give a specific path to 'restore_model'.
- 2017-11-04 ver 0.2.3
- Hybrid feature architecture for LSTM and corresponding tagger's python script.
- 2017-10-31 ver 0.2.2
- Update Neural Text Process lib 0.2.0
- Compatible modification in main file.
- 2017-10-20 ver 0.2.1
- Fix: Non-suffix for template in 'only test' process.
- Fix: Now using correct dicts for embedding lookup table.
- Fix: A bug of batch generator 'batch_index'.
- 2017-09-12 ver 0.2.0
- Update: process lib 0.1.2
- Removed 'keras_src', completed the refactoring of the code hierarchy.
- Added env_settings.py to make sure all default dirs exist.
- Support restore model from file.
- Support model selection.
- 2017-07-06 ver 0.1.3
- Add new method 'accuracy', which used to calculate correct labels
- Arguments 'emb_type' & 'emb_dir' now are deprecated.
- New argument 'emb_file'
- 2017-04-11 ver 0.1.2
- Rewrite neural_tagger class method: loss.
- Add a new tagger based Bi-LSTM + CNNs, where CNN used to extract bigram features.
- 2017-04-08 ver 0.1.1
- Rewrite class lstm-ner & bi-lstm-ner.
- 2017-03-03 ver 0.1.0
- Using tensorflow to implement the LSTM-NER model.
- Basical function finished.
- 2017-02-26 ver 0.0.3
- lstm_ner basically completed.
- viterbi decoding algorithm and sequence labeling.
- Pretreatment completed.
- 2017-02-21 ver 0.0.2
- Basical structure of project
- Added 3 module file: features, pretreatment and constant
- Pretreatment's create dictionary function completed
- 2017-02-20 ver 0.0.1
- Initialization of this project.
- README file
- Some util functions and basical structure of project