Pre-trained model for fastText

Compatible with fastText.py (fasttext 0.8.3 in pypi)
Tokenized with mecab-ipadic-NEologd

Background

Doing japanese NLP task with fastText and MeCab, I found fastText.py(fasttext 0.8.3 in pypi) is not updated with the up-to-date fastText(I checked it in 20170912). So pre-trained model provided in fastText repository didn't work. Pre-trained models by Kyubyong worked with fastText.py(0.8.3) as it is trained with old version of fastText, but it seemed to use default MeCab dictionary. So I trained fastText model with japanese wikipedia data (20170820 dump) and mecab-ipadic-NEologd (20170907).

Environment

nltk >= 1.11.1
regex >= 2016.6.24
lxml >= 3.3.3
numpy >= 1.11.2
MeCab & MeCab python binding
mecab-ipadic-NEologd
fastText.py

How to train your own model

Download the japanese wikipedia database backup dumps.
Extract running texts to data/ folder.
Set up the environment.
Run $ python build_corpus.py --lcode=ja --max_corpus_size=1000000000. Adjust max_corpus_size as you want.
Run $ python ft.py to get fastText word vector in data/ folder. Adjust min_count as you want in ft.py.

Download pre-trained model

Click here (Vector size: 300, Vocabulary size: 92056)

References

About

fastText pre-trained model for japanese, tokenized with mecab-ipadic-NEologd

MIT License

Languages

Language:Python 100.0%