Distribute compiled dictionaries from JumanDic
kampersanda opened this issue · comments
In v0.3.1, compiled dictionaries from JumanDic have not been distributed because the lexicon file is in an unexpected CSV format.
More precisely, we will get the following error message from the compile
command.
Error: InvalidFormat(InvalidFormatError { arg: "lex.csv", msg: "A csv row of lexicon must have five items at least, \"\\n\"" })
We need to modify the code to compile this file.
@kampersanda The following commands work:
sudo apt install mecab-jumandic-utf8
cargo run --release -p compile -- \
-l <(cat /usr/share/mecab/dic/juman/*.csv | sed 's/\xe3\x81,/,/g') \
-m /usr/share/mecab/dic/juman/matrix.def \
-u /usr/share/mecab/dic/juman/unk.def \
-c /usr/share/mecab/dic/juman/char.def \
-o juman.dic
AuxV.csv contains invalid UTF-8 sequences \xe3\x81,
, so the above command replaces them with ,
.
@vbkaisetsu Thanks for the report! I did not know mecab-jumandic-utf8. I'll add the JumanDic version in Release 0.3.1.
@vbkaisetsu I released a dictionary compiled from mecab-jumandic-utf8 on https://github.com/daac-tools/vibrato/releases/tag/v0.3.1. Thank you for your cooperation!