XZJIsme / didide-BERT

保护地得,从我做起

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

保护地得,从我做起

Didide 是基于 BERT 预训练模型进行微调的“的地得”纠错模型,训练数据使用维基百科中文语料库生成。很多人在网上打字爱用“的”代替“得”和“地”,轻则影响观感,重则造成歧义。大家不爱正确用字,无可厚非,但是也令人遗憾,因为这没那么难。

Due to the homophones "的" (de), "地" (de), and "得" (de) in Mandarin Chinese, many people tend to interchange them when typing, which can affect readability and even lead to ambiguity. While it's understandable that not everyone pays strict attention to proper word usage, it's also a pity because it's not that difficult to use them correctly.

Sir, this way

Didide is a model for classifying 的, 地, 得 based on BERT. The training data is generated from the Chinese Wikipedia corpus.

Get started now

conda create -n didide python=3.10 -y
conda activate didide
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y
pip install transformers datasets

Convert a TensorFlow pre-trained BERT model to PyTorch

conda install tensorflow -y
mkdir BERT-trained
cd BERT-trained
wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
unzip chinese_L-12_H-768_A-12.zip
cd ..
python bert_converter.py
# of course, you can also use the default chinese bert model provided by huggingface which may be better

Generate DidideData

# python dataset_generator.py
"""
Here we use wiki_zh from https://github.com/brightmart/nlp_chinese_corpus
A sample of processed wiki_zh data is in data/samples, named wiki_zh_mini.pkl which contains a list generated by the script.
"""
# for file in glob.glob("yourcorpus/**", recursive=True):
# ↑if you use customized corpus, some modifications are needed, see the easy-to-read script for details.
for file in glob.glob("data/wiki_zh/**", recursive=True):
    if os.path.isdir(file):
        continue
    files.append(file)
# Total samples:  311569
# 的:122867,地:129195,得:59507
"""
To get a better distribution, the number of 的 is reduced.
"""

Train the model

python model_training.py
# with a good hyperparameter setting, just a few epochs needed to obtain a good accuracy on test set like about 96%.

Trained models

百度网盘

Give it a try

python playground.py "我觉的我烦的有点难过,因为我得培根忘记吃了" "didide_model.pt"
# the output'd be: 我觉得我烦得有点难过,因为我的培根忘记吃了
python playground.py "我觉的我烦的有点难过,因为我得培根忘记吃了,而且这种东西得营养一般般,但是好吃的哟!我天天早上开心的享受它的味道,开心的受不鸟哩!我咔咔的吃,吃的要满嘴流油 ,哈哈哈,痛快放肆的吃" "didide_model.pt"
# the output'd be: 我觉得我烦得有点难过,因为我的培根忘记吃了,而且这种东西的营养一般般,但是好吃的哟!我天天早上开心地享受它的味道,开心得受不鸟哩!我咔咔地吃,吃得要满嘴流油,哈哈哈,痛快放肆地吃
python playground.py "我要飛的更高,測試一下繁體預測的對不對,分類的還不錯"
# the output'd be: 我要飛得更高,測試一下繁體預測的對不對,分類得還不錯
# also works well with traditional Chinese. Yeah, it's because they have same input ids actually.

In the last

本模型其实就是 BERT 加了个 MLP。简单说下获得的经验吧!一个就是数据的分布会影响训练的性能,所以“地”“得”的样本数量被增加了;另一个就是数据集变超大后训练了两天两夜效果果然也相应地变更好,果然 Transformer 类模型就是要大力出奇迹啊!

本项目仅供学习交流,如有问题请提 issue !有什么好建议请随时提 issue !

Also, welcome to contribute to this project! A trained model has been released. Take a look. Have fun.

ToDo

  • Add a trained model
  • The way of generating test dataset should match the way of the playground, different from the way of training data generation, then the robustness of the model can be tested more reasonably.
  • Give a lightweigth model by quantization.
  • Give an elaborate introduction in Chinese. 是**人就说你好!
  • Yes, nobody cares this repo... 呜呜呜

Reference

About

保护地得,从我做起


Languages

Language:Python 100.0%