When using a user dictionary, how to split kanji
paulm17 opened this issue · comments
I'm using a user dictionary, an entry:
朝顔,朝 顔,あさ かお,あさ かお
I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.
How do I achieve this?
Thanks
There was no way to get an entry in the user dictionary :p
In v2.9.0, UserExtra() was added to get user dictionary information.
Sample code:
package main
import (
"fmt"
"github.com/ikawaha/kagome-dict/dict"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
udict, err := dict.NewUserDict("user_dict.txt")
if err != nil {
panic(err)
}
t, err := tokenizer.New(ipa.Dict(), tokenizer.UserDict(udict), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
tokens := t.Analyze("朝顔が咲く", tokenizer.Extended)
for _, v := range tokens {
fmt.Printf("%s:\t%s", v.Surface, v.Features())
if extra := v.UserExtra(); extra != nil {
fmt.Printf("\t extra: tokens %+v, readings %+v", extra.Tokens, extra.Readings)
}
fmt.Println()
}
}
Output:
朝顔: [あさ かお 朝/顔 あさ/かお] extra: tokens [朝 顔], readings [あさ かお]
が: [助詞 格助詞 一般 * * * が ガ ガ]
咲く: [動詞 自立 * * 五段・カ行イ音便 基本形 咲く サク サク]
Thank you for making the change! I really appreciate it. 🚀
I can confirm that it works! 🔥
Funny enough, it was working when concatenating kanji but not for my use case. As I was using similar code to yours.
Quick follow up. What's the difference between
tokens := t.Tokenize(kanji) - Which is what I was using before and
tokens := t.Analyze(kanji, tokenizer.Extended) - Which is what you have above.
Thanks!
kagome has some segmentation modes.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
see. https://github.com/ikawaha/kagome#segmentation-mode-for-search
t.Tokenize(s)
is an alias of t.Analyze(s, tokenizer.Normal)
.
I'm sorry for the confusion caused by the use of tokenizer.Extended
in the sample code above. Please choose the mode that best suits your environment (Normal or Search mode is recommended).
Will do. Thanks again!