himkt / awesome-bert-japanese

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

awesome-bert-japanese

日本語の学習済み BERT は文から単語への分かち書き,単語からサブワードへの分割の処理にいくつかの選択肢が存在します. また,単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります.

本リポジトリでは,公開されている学習済み BERT モデルについて, 分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています.

A list of pre-trained BERT models for Japanese. Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters. Therefore, it requires word segmentation before tokenizing word into subwords. I summarize pretrained BERT models for Japanese by word segmentation algorithm, subword tokenization algorithm, and algorithm for constructing vocabulary used in subword tokenization.

Model

Model Sentence -> Words Word -> Subword Algorithm for constructing vocabulary used in subword tokenization
Google (Multilingual BERT) Whitespace WordPiece BPE?
Kikuta -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Hotto Link Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Kyoto University Juman++ (JUMANDIC?) WordPiece subword-nmt (BPE)
Stockmark Inc. (a) MeCab (mecab-ipadic-neologd) -- --
Tohoku University (a) MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Tohoku University (b) MeCab (mecab-ipadic) Character Sentencepiece (model_type=character)
NICT (a) MeCab (mecab-jumandic) WordPiece subword-nmt (BPE)
NICT (b) MeCab (mecab-jumandic) --- ---
akirakubo (a) MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
akirakubo (b) SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
The University of Tokyo MeCab (mecab-ipadic-neologd + user dic (J-MeDic) WordPiece ? (BPE)
Laboro.AI Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Bandai Namco Research Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Retrieva, Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Waseda University Juman++ (JUMANDIC) WordPiece Sentencepiece (model_type=unigram)
LINE Corp. MeCab (mecab-unidic) WordPiece Sentencepiece (model_type=bpe)
Stockmark Inc. (b) MeCab (mecab-ipadic-neologd) WordPiece Sentencepiece (model_type=?)
  • NICT: National Institute of Information and Communications Technology
  • without word segmentation: 文を単語に分割せず直接サブワードへ分割する
  • For models by Tohoku University, MeCab+mecab-ipadic-neologd is used for sentence segmentation (thanks @ikuyamada san!)
  • For models by akirakubo, documents in Aozora bunko are classified into two categories. It is based on types of kana spelling. (thanks @kkadowa san and @akirakubo san!
  • For DistilBERT (by Bandai Namco Resean Inc.), the same word segmentation and algorithm for constructing vocabulary are used both for teacher/studen models.

Reference

About

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information