- Compatible with fastText.py (fasttext 0.8.3 in pypi)
- Tokenized with mecab-ipadic-NEologd
Doing japanese NLP task with fastText and MeCab, I found fastText.py(fasttext 0.8.3 in pypi) is not updated with the up-to-date fastText(I checked it in 20170912). So pre-trained model provided in fastText repository didn't work. Pre-trained models by Kyubyong worked with fastText.py(0.8.3) as it is trained with old version of fastText, but it seemed to use default MeCab dictionary. So I trained fastText model with japanese wikipedia data (20170820 dump) and mecab-ipadic-NEologd (20170907).
- nltk >= 1.11.1
- regex >= 2016.6.24
- lxml >= 3.3.3
- numpy >= 1.11.2
- MeCab & MeCab python binding
- mecab-ipadic-NEologd
- fastText.py
- Download the japanese wikipedia database backup dumps.
- Extract running texts to
data/
folder. - Set up the environment.
- Run
$ python build_corpus.py --lcode=ja --max_corpus_size=1000000000
. Adjust max_corpus_size as you want. - Run
$ python ft.py
to get fastText word vector indata/
folder. Adjust min_count as you want inft.py
.
Click here (Vector size: 300, Vocabulary size: 92056)