daac-tools / vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

Home Page:https://docs.rs/vibrato

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distribute compiled dictionaries from JumanDic

kampersanda opened this issue · comments

In v0.3.1, compiled dictionaries from JumanDic have not been distributed because the lexicon file is in an unexpected CSV format.
More precisely, we will get the following error message from the compile command.

Error: InvalidFormat(InvalidFormatError { arg: "lex.csv", msg: "A csv row of lexicon must have five items at least, \"\\n\"" })

We need to modify the code to compile this file.

@kampersanda The following commands work:

sudo apt install mecab-jumandic-utf8
cargo run --release -p compile -- \
    -l <(cat /usr/share/mecab/dic/juman/*.csv | sed 's/\xe3\x81,/,/g') \
    -m /usr/share/mecab/dic/juman/matrix.def \
    -u /usr/share/mecab/dic/juman/unk.def \
    -c /usr/share/mecab/dic/juman/char.def \
    -o juman.dic

AuxV.csv contains invalid UTF-8 sequences \xe3\x81,, so the above command replaces them with ,.

@vbkaisetsu Thanks for the report! I did not know mecab-jumandic-utf8. I'll add the JumanDic version in Release 0.3.1.

@vbkaisetsu I released a dictionary compiled from mecab-jumandic-utf8 on https://github.com/daac-tools/vibrato/releases/tag/v0.3.1. Thank you for your cooperation!