lounlee / fasttext_jpn_model_neologd

fastText pre-trained model for japanese, tokenized with mecab-ipadic-NEologd

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pre-trained model for fastText

Background

Doing japanese NLP task with fastText and MeCab, I found fastText.py(fasttext 0.8.3 in pypi) is not updated with the up-to-date fastText(I checked it in 20170912). So pre-trained model provided in fastText repository didn't work. Pre-trained models by Kyubyong worked with fastText.py(0.8.3) as it is trained with old version of fastText, but it seemed to use default MeCab dictionary. So I trained fastText model with japanese wikipedia data (20170820 dump) and mecab-ipadic-NEologd (20170907).

Environment

How to train your own model

  1. Download the japanese wikipedia database backup dumps.
  2. Extract running texts to data/ folder.
  3. Set up the environment.
  4. Run $ python build_corpus.py --lcode=ja --max_corpus_size=1000000000. Adjust max_corpus_size as you want.
  5. Run $ python ft.py to get fastText word vector in data/ folder. Adjust min_count as you want in ft.py.

Download pre-trained model

Click here (Vector size: 300, Vocabulary size: 92056)

References

About

fastText pre-trained model for japanese, tokenized with mecab-ipadic-NEologd

License:MIT License


Languages

Language:Python 100.0%