ustcmike / modern_chinese_nlp

(WIP) My humble contribution to the democratization of the Chinese NLP technology

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Modern (Deep) Chinese NLP Models

My humble contribution to the democratization of the Chinese NLP technology (currently based on Fast.ai library).

[WIP] This project is still in a very early development stage. Things might change dramatically in the near future.

20190425 Update

This project has lost its purpose since the moment BERT released its multilingual and Chinese versions of pretrained models. Readers are advised to check out those models and other similar projects (e.g. Universal Sentence Encoder).

Development Notes

This codebase is under major overhaul. Previously it heavily depended on fast.ai v0.7 which also does a lot of things besides NLP. Now fast.ai v0.7 is replaced with a lightweight general PyTorch helper bot and an NLP package dekisugi in this library. I'm considering adding some dependencies to other well-maintained and modularized librarires (e.g. torchtext, AllenNLP, etc.) to reduce future maintenence workloads.

Currently only LSTM models are both migrated and tested. QRNN models are migrated but not tested. Transformer models are not migrated.

The old fast.ai notebooks and model code can be found under legacy folder. This repo also has a branch fastai_based that still uses fast.ai v0.7.

Workflow and Scripts

Wikipedia

Tokenization:

  • scripts/wiki_tokenize_json.py: Character-level and word-level(using Thulac) tokenization of the Wikipedia json dump file.
  • scripts/wiki_sp_tokenize_json.py: Sentencepeice tokenization of the Wikipedia json dump file.

Language Model:

  • scripts/language_model/train_rnn_language_model.py: Train an LSTM language model.

Douban Sentiment Corpus

Tokenization:

  • scripts/douban_sp_preprocess.py: Sentencepeice tokenization of the Douban corpus.

Language Model fine-tuning:

  • scripts/douban_pretrain_lm.py

Sentiment classification model:

  • scripts/douban_sentiment.py

Blog post(s)

Using codebase based on fast.ai v0.7:

  1. [Preview] Developing Modern Chinese NLP Models - Briefly described dataset preparation processes and some preliminary results.
  2. [NLP] Four Ways to Tokenize Chinese Documents

Using dekisugi:

  1. [NLP] Using SentencePiece without Pretokenization

Environment

Pre-built docker image: docker pull ceshine/pytorch-fastai. Or build the image yourself with the accompanied Dockerfile.

Available models

Sub-Word-Level Models

  • Language Model (Wikipedia Articles): next character prediction.
  • Sentiment Analysis (Douban Movie Reviews): movie rating prediction.

About

(WIP) My humble contribution to the democratization of the Chinese NLP technology

License:MIT License


Languages

Language:Jupyter Notebook 89.7%Language:Python 10.3%Language:Dockerfile 0.0%