Word-Segmentation

Chinese Word Segmentation with FMM, BMM, BiMM, MMSeg, HMM, BiLSTM-CRF and BERT-CRF.

Corpus

Methods

Rule based:
- Forward Maximum Matching Method (FMM)
- Backward Maximum Matching Method (BMM)
- Bi-directction Matching Method (BiMM)
- Maximum Matching Segment (MMSeg)
ML based:
- Hidden Markov Model (HMM)
DL based:
- BiLSTM-CRF
- BERT-CRF
Tools:
- jieba
- HanLP
- thulac
- LTP

Results

Method	PKU P	PKU R	PKU F1	PKU Time(s)	MSR P	MSR R	MSR F1	MSR Time(s)
FMM	80.17	78.14	79.14	0.767	79.82	80.75	80.28	0.831
BMM	80.46	78.44	79.44	0.764	79.86	80.79	80.32	0.794
Bi-MM	80.59	78.49	79.53	1.537	80.08	80.94	80.51	1.626
MMSeg	80.57	78.47	79.51	23.108	80.01	80.87	80.44	10.818
HMM	79.22	77.22	78.21	1.258	75.69	77.79	76.73	1.308
BiLSTM-CRF*	92.92	92.24	92.58	11.651	96.74	96.47	96.61	10.630
BERT-CRF $^{\dagger*}$	97.10	96.10	96.60	23.334	98.33	98.10	98.21	20.495
jieba	85.26	78.66	81.83	0.490	81.51	80.92	81.22	0.541
HanLP	86.82	81.20	83.92	1.747	82.72	80.81	81.75	1.635
thulac	92.24	92.33	92.28	3.533	83.28	87.75	85.45	3.853
LTP*	95.55	93.83	94.68	16.040	86.63	89.66	88.12	31.100

*Indicates that the model uses RTX 3090 GPU to accelerate inference;

$^\dagger$ BERT-CRF has only trained 20 epochs, and the performance can be improved by further training;

See the models directory for logs.

Requirements

python==3.7
torch==1.12
transformers=4.24
see requirements.txt.

Run

Prepare data

cd data
wget http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip
unzip icwb2-data.zip

cd icwb2-data
mkdir msr_processed & pku_processed
cp training/msr_training.utf8 msr_processed/training.utf8
cp gold/msr_test_gold.utf8 msr_processed/test.utf8

cp training/pku_training.utf8 pku_processed/training.utf8
cp gold/pku_test_gold.utf8 pku_processed/test.utf8

Word dict

cd data
wget https://raw.githubusercontent.com/fxsjy/jieba/master/jieba/dict.txt

Run model

Evaluate fmm, bmm, bimm, mmseg, hmm, and jieba, thulac, ltp, hanlp:

bash scripts/eval.sh

Train and Evaluate BiLSTM-CRF:

bash scripts/train_bilstm_crf.sh

Train and Evaluate BERT-CRF:

bash scripts/train_bert_crf.sh

Web UI

Back-end

based on Flask:

CUDA_VISIBLE_DEVICES=0 FLASK_APP=app.py flask run

Front-end

based on create-react-app:

npm install
npm install react-scripts
NODE_OPTIONS=--openssl-legacy-provider npm start

Cases

Methods	Sentence1	Sentence2
FMM	玲/英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆/大爷/不管/啊/！/”/	我/也/想/过过/过儿/过过/的/生活
BMM	玲/英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆/大爷/不管/啊/！/”/	我/也/想/过过/过儿/过过/的/生活
Bi-MM	玲/英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆/大爷/不管/啊/！/”/	我/也/想/过过/过儿/过过/的/生活
MMSeg	玲/英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆/大爷/不管/啊/！/”/	我/也/想/过过/过儿/过过/的/生活
HMM	玲/英思前/想/后/，/对/哥哥/说/：“/我/不/忍心/扔/下穆大爷/不管/啊/！”	我/也/想/过过/过儿/过过/的/生活
BiLSTM-CRF	玲/英思/前/想/后/，/对/哥哥/说/：/“/我/不忍/心/扔/下/穆/大爷/不管/啊/！/”	我/也/想/过/过/过儿/过/过/的/生活
BERT-CRF	玲英/思前想后/，/对/哥哥/说/：/“/我/不/忍心/扔下/穆/大爷/不管/啊/！/”	我/也/想/过/过/过/儿/过/过/的/生活
jieba	玲英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆/大爷/不管/啊/！/”/	我/也/想/过/过/过儿/过过/的/生活
thulac	玲/英思/前想后/，/对/哥哥/说/：/“/我/不/忍心/扔下/穆/大爷/不/管/啊/！/”	我/也/想/过/过/过儿/过/过/的/生活
LTP	玲英思前想后/，/对/哥哥/说/：/“/我/不/忍心/扔下/穆/大爷/不/管/啊/！/”	我/也/想/过/过/过/儿/过/过/的/生活
HanLP	玲/英/思前想后/，/对/哥哥/说/：/“/我/不忍心/扔下/穆大爷/不管/啊/！/”/	我/也/想/过过/过/儿/过过/的/生活
Gold	玲英/思前想后/，/对/哥哥/说/：/“/我/不/忍心/扔下/穆/大爷/不管/啊/！/”	我/也/想/过过/过儿/过过/的/生活

References

About

Chinese Word Segmentation with FMM, BMM, BiMM, MMSeg, HMM, BiLSTM-CRF and BERT-CRF.

MIT License

Languages

Language:Python 87.8%Language:JavaScript 9.4%Language:HTML 1.6%Language:CSS 1.2%Language:Shell 0.2%