chinese-nlp deep-learning bert-model wandb ernie

Bert/ERNIE Chinese Text Classification Pytorch

Chinese Text Classification using BERT (Bidirectional Encoder Representation from Transformers), BERT variants and ERNIE (Enhanced Language Representation with Informative Entities), implemented by PyTorch, monitored training by WandB (Weights & Biases)

Monitoring

Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B's lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues. For details please click.

Platform

Machine：1 GTX 1650 MaxQ 4GB RAM，batch size * padding size <= 512. If GPU's RAM >= 4GB, you are suggested to enlarge the padding size and batch size (power of 2, 2^n). Average training period：25 - 45mins

Envrionment

python 3.7

Library (Pip install OR Conda install )

pytorch 1.1
tqdm
sklearn
tensorboardX
bota3 wandb (model monitoring)
matplotlib (EDA)
~~pytorch_pretrained_bert~~ (pretrained package is uploaded, this repository is not required)

Chinese Dataset

Chinese GLUE public challenge

119 Classes：打車地圖導航免費WIFI 租車同城服務快遞物流婚慶家政公共交通政務社區服務薅羊毛魔幻仙俠卡牌飛行空戰射擊遊戲休閒益智動作類體育競技棋牌中心經營養成策略 MOBA 輔助工具約會社交即時通訊工作社交論壇圈子婚戀社交情侶社交社交工具生活社交微博博客新聞漫畫小說技術教輔問答交流搞笑雜誌百科影視娛樂求職兼職視頻短視頻音樂直播電台 K歌成人中小學職考公務員英語視頻教育高等教育成人教育藝術語言(非英語) 旅游資訊綜合預定民航鐵路酒店行程管理民宿短租出國工具親子兒童母嬰駕校違章汽車諮詢汽車交易日常養車行車輔助租房買房裝修家居電子產品問診掛號養生保健醫療服務減肥瘦身美妝美業菜譜餐飲店體育咨訊 **健身支付保險股票借貸理財彩票記賬銀行美顏影像剪輯攝影修圖相機繪畫二手電商團購外賣電影票務社區超市購物諮詢筆記辦公日程管理女性經營收款其他

Dataset Segmentation：

Dataset	Rows
Train	12k
Valid	2.5k
Test	2.5k

Update Your Own dataset

Formatize the Chinese dataset to specified format
txt file format, each row: {1 text}{1 tab space}{label id}
provided code for transform json file into text file
- json file format, each row: {"label":"11","label_des":"class","sentence":"xxx"}

Result

Model	Accuracy	#words in sentence (padding size)	Remarks
bert	53.83%	32
bert	55.18%	128
bert_CNN	58.25%	32	bert + CNN
bert_CNN	58.21%	64	bert + CNN
bert_RCNN	55.91%	32	bert + RCNN
ERNIE	58.95%	32	ERNIE
ERNIE	59.25%	64	ERNIE

Previous test on models including TextCNN, TextRNN, TextRNN+Att, TextRCNN, DRCNN and Transformer, those performance are worse than that of BERT family and ERNIE models. BERT models with connected components are slightly better than pure BERT model. ERNIE model is the best among those models.

Model Introduction

BERT

ERNIE

pretrained language model

BERT related documents are located at directory /bert_pretain，ERNIE are at /ERNIE_pretrain，those documents should be existed for both directories：

pytorch_model.bin
bert_config.json
vocab.txt

Pretrained model download sites:

bert_Chinese: model https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz
Vocab https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt 或 https://huggingface.co/ckiplab/bert-base-chinese/tree/main

from here
spare：model：https://pan.baidu.com/s/1qSAD5gwClq7xlgzl_4W3Pw

ERNIE_Chinese: http://image.nghuyong.top/ERNIE.zip
from here
spare：model：http://pan.nghuyong.top/#/s/y7Uz (ERNIE_pretrain file rename: config.json-> bert_config.json)

After decompression, all required files in pretrain directories are ready.

Instruction

This repository is built for feeding json text file. All json files are supposed to locate at THUCNews/data directory.

# json_to_txt will generate files (train.txt, dev.txt, test.txt and test2.txt) based on provided json files in THUCNews/data
python json_to_txt.py --datafile THUCNews/data

# model training requires train set (train.txt), validation set (dev.txt), test set with labels (remarks test.txt and dev.txt are the same)

# model inference requires test set (test2.txt)

Execution after download pretrained models

# Train：
## model options: bert, bert_CNN, bert_RNN, bert_RCNN, bert_DPCNN, ERNIE
# bert
python run.py --model bert

# bert + other extensions
python run.py --model bert_CNN

# ERNIE
python run.py --model ERNIE

# Choose one of models for inference (THUCNews/saved_dict)存在的model.ckpt
## model options: bert, bert_CNN, bert_RNN, bert_RCNN, bert_DPCNN, ERNIE
python pred.py --model ERNIE
# generate test_pred.json as the inference result

WandB

pip install wandb
wandb login 
# input API key in login page into cmd prompt

# before training, you could rename your experiment in models/bert_xxx.py, modify model_name(line 12), this name would be recorded as the name of experiment in WandB page. 

# Open WandB repository page to monitor training process

Parameters

All models under directory /models，including hyperparameters and model architectures.

TBC

Add model explanation part using SHAP values
Display SHAP plots on WandB

Paper

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper

[2] ERNIE: Enhanced Representation through Knowledge Integration paper

About

Chinese Text Classification using BERT (Bidirectional Encoder Representation from Transformers), BERT variants and ERNIE (Enhanced Language Representation with Informative Entities), implemented by PyTorch, monitored training by WandB (Weights & Biases)

chinese-nlp deep-learning bert-model wandb ernie

MIT License

Languages

Language:Jupyter Notebook 63.3%Language:Python 36.7%