akirakubo / bert-japanese-aozora

Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Japanese BERT trained on Aozora Bunko and Wikipedia

This is a repository of Japanese BERT trained on Aozora Bunko and Wikipedia.

Features

  • We provide models trained on Aozora Bunko. We used works written both in contemporary Japanese kana spelling and in classical Japanese kana spelling.
  • Models trained on Aozora Bunko and Wikipedia are also available.
  • We trained models by applying different pre-tokenization methods (MeCab with UniDic and SudachiPy).
  • All models are trained with the same configuration as the bert-japanese (except for tokenization. bert-japanese uses SentencePiece unigram language model without pre-tokenization).
  • We provide models with 2M training steps.

Pretrained models

If you want to use models with 🤗 Transformers, see Converting Tensorflow Checkpoints.

When you use models, you will have to pre-tokenize datasets with the same morphological analyzer and the dictionary.

When you do fine-tuning tasks, you may want to modify official BERT codes or Transformers codes. BERT日本語Pretrainedモデル - KUROHASHI-KAWAHARA LAB will help you out.

BERT-base

After pre-tokenization, texts are tokenized by subword-nmt. Final vocab size is 32k.

Trained on Aozora Bunko (6M sentences)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (1.5M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (3M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Details of corpora

  • Aozora Bunko: Git repository as of 2019-04-21
    • git clone https://github.com/aozorabunko/aozorabunko and git checkout 1e3295f447ff9b82f60f4133636a73cf8998aeee.
    • We removed text files with 作品著作権フラグ = あり in index_pages/list_person_all_extended_utf8.zip.
  • Wikipedia (Japanese): XML dump as of 2018-12-20

Details of pretraining

Pre-tokenization

For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling, unidic-qkana is used for classical kana spelling.

In SudachiPy, we use split mode A ($ sudachipy -m A -a file) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.

After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.

Identifying kana spelling

Wikipedia

We assume that contemporary kana spelling is used.

Aozora Bunko

index_pages/list_person_all_extended_utf8.zip has 文字遣い種別 column that is the information of kanji (旧字 or 新字) and kana spelling (旧仮名 or 新仮名). We use only kana spelling information.

About

Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy

License:Apache License 2.0