switch to text8 dataset w/ frequency data
proppy opened this issue · comments
Johan Euphrosine commented
There are a lot of obscure words in the current edict2 dict, we should consider migrate to a corpus with frequency information to only yield common words.
See:
https://github.com/Hironsan/ja.text8
We could filter the dataset using Noun list from https://packages.debian.org/jessie/misc/mecab-ipadic and compute frequency list.
We should consider excluding words with only one syllabus as those will come more frequency and don't make for interesting combination with shiritori.