wjn1996 / Mathematical-Knowledge-Entity-Recognition

This is a novel project for mathematical knowledge entity recognition. The algorithm is mainly modeled by BiLSTM+CRF with Chinese Word Embeddings. This project is the first process for Mathematical Knowledge Graph(Math-KG).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mathematical-Knowledge-Entity-Recognition

1.Introduction

This is a novel project for mathematical knowledge entity recognition. The algorithm is mainly modeled by BiLSTM+CRF with Chinese Word Embeddings. To the best of our knowledge, this project is the first for Mathematical Knowledge Graph(Math-KG).

2.Copyright Notice

Leader: WangJianing
Email:lygwjn@126.com
CSDN:https://blog.csdn.net/qq_36426650

3.Project Overview This code contains 6 files, such as:
main.py——which you can run by python;
model.py——which contains our models;
eval.py——which contains the procedure to run perl code;
data.py——mainly preprocess dataset;
run_mainpy——mainly generate pkl file about all vocabs;
utils.py——contains the decoder on labels.

4.Notes

We define two kinds entities such as "KNOW" and "PRIN". "KNOW" represents the objective mathematical knowledge, while "PRIN" denotes the abstract mathematical theorem or method. We use "B" represents the first character in an entity, and "I" represents other characters. Non-entities are marked as "O".

5.Useage Details After download this project, you had better follow these steps to run our program.

(1) Firstly, you should download the mathematical NLP datasets:https://blog.csdn.net/qq_36426650/article/details/87719204, This webpage also has very detail notes about how to use this dataset. This page is in Chinese, if you don't know Chinese, you can have your browser translate it into the language you want to view the entire blog information.

The datasets contains two subject:Junior middle school mathematics & high middle school mathematics. Each dataset consists of two files, the training set "ner_train_data" and the test set "ner_test_data".

You should alse download the Chines word embeddings. We have pretrained word embeddings from Wikipedia such as word2vec, glove and gwe with 300-dimension.

(2) Secondly, you should create a new directionary to store the dataset.

(3) Please open file run_main.py and edit the variable "file" value. And then generate a word2id.pkl file by run:

python3 run_main.py "<dataset dictionary>"

for example, if you create a fold named "math" and put datasets in it, the command is "python3 run_main.py math"

After that, you will achieve a new file named "word2id.pkl".

(4) If you want to train by yourslef, you can run:

python3 main.py --train_data=<trainset file name> --CRF=<True or False> --embedding_type=<word embedding type> --mode="train"

For example:

python3 main.py --train_data="highmath_data" --CRF=True --embedding_type="glove" --mode="train"

The model will be stored in new files.

Of course, you can change some hyper-parameters in file main.py.

(5) After training processing, you can run test.py to evalue the model by PRF1.

python3 main.py --test=<testset file name> --CRF=<True or False> --embedding_type=<word embedding type> --mode="test"

for example:

python3 main.py --test_data="highmath_data" --CRF=True --embedding_type="glove" --mode="test"

(6) We also provide a demo that you can feed only one sentence to the model. you can run:

python3 main.py --mode="demo" --demo_model=<model dictionary>

You can input a sentence into command and then the model can return all entitys with tags.

5.Result Demonstration The result of our pretrain model is shown as follow: Alt text The demonstration of our demo is shown: Alt text

About

This is a novel project for mathematical knowledge entity recognition. The algorithm is mainly modeled by BiLSTM+CRF with Chinese Word Embeddings. This project is the first process for Mathematical Knowledge Graph(Math-KG).

License:Apache License 2.0


Languages

Language:Python 100.0%