When using a user dictionary, how to split kanji

Question

When using a user dictionary, how to split kanji

paulm17 opened this issue 2 years ago · comments

I'm using a user dictionary, an entry:

朝顔,朝 顔,あさ かお,あさ かお

I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.

How do I achieve this?

Thanks

ikawaha · Answer 1 · Mon Oct 31 2022 20:36:11 GMT+0800 (China Standard Time)

There was no way to get an entry in the user dictionary :p
In v2.9.0, UserExtra() was added to get user dictionary information.

Sample code:

package main

import (
	"fmt"

	"github.com/ikawaha/kagome-dict/dict"
	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	udict, err := dict.NewUserDict("user_dict.txt")
	if err != nil {
		panic(err)
	}
	t, err := tokenizer.New(ipa.Dict(), tokenizer.UserDict(udict), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	tokens := t.Analyze("朝顔が咲く", tokenizer.Extended)
	for _, v := range tokens {
		fmt.Printf("%s:\t%s", v.Surface, v.Features())
		if extra := v.UserExtra(); extra != nil {
			fmt.Printf("\t extra: tokens %+v, readings %+v", extra.Tokens, extra.Readings)
		}
		fmt.Println()
	}
}

Output:

朝顔:	[あさ かお 朝/顔 あさ/かお]	 extra: tokens [朝 顔], readings [あさ かお]
が:	[助詞 格助詞 一般 * * * が ガ ガ]
咲く:	[動詞 自立 * * 五段・カ行イ音便 基本形 咲く サク サク]

Paul · Answer 2 · Mon Oct 31 2022 20:59:22 GMT+0800 (China Standard Time)

Thank you for making the change! I really appreciate it. 🚀

I can confirm that it works! 🔥

Funny enough, it was working when concatenating kanji but not for my use case. As I was using similar code to yours.

Quick follow up. What's the difference between

tokens := t.Tokenize(kanji) - Which is what I was using before and

tokens := t.Analyze(kanji, tokenizer.Extended) - Which is what you have above.

Thanks!

ikawaha · Answer 3 · Mon Oct 31 2022 23:23:10 GMT+0800 (China Standard Time)

kagome has some segmentation modes.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also uni-gram unknown words

see. https://github.com/ikawaha/kagome#segmentation-mode-for-search

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal).

I'm sorry for the confusion caused by the use of tokenizer.Extended in the sample code above. Please choose the mode that best suits your environment (Normal or Search mode is recommended).

Paul · Answer 4 · Mon Oct 31 2022 23:50:14 GMT+0800 (China Standard Time)

Will do. Thanks again!