actions-on-google / actionssdk-shiritori-ja-nodejs

しりとり AoG サンプルゲーム

Home Page:https://assistant.google.com/services/a/uid/00000064f48a4a82

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

switch to text8 dataset w/ frequency data

proppy opened this issue · comments

There are a lot of obscure words in the current edict2 dict, we should consider migrate to a corpus with frequency information to only yield common words.

See:
https://github.com/Hironsan/ja.text8

We could filter the dataset using Noun list from https://packages.debian.org/jessie/misc/mecab-ipadic and compute frequency list.

We should consider excluding words with only one syllabus as those will come more frequency and don't make for interesting combination with shiritori.