ikawaha / kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When using a user dictionary, how to split kanji

paulm17 opened this issue · comments

commented

I'm using a user dictionary, an entry:

朝顔,朝 顔,あさ かお,あさ かお

I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.

How do I achieve this?

Thanks

There was no way to get an entry in the user dictionary :p
In v2.9.0, UserExtra() was added to get user dictionary information.

Sample code:

package main

import (
	"fmt"

	"github.com/ikawaha/kagome-dict/dict"
	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	udict, err := dict.NewUserDict("user_dict.txt")
	if err != nil {
		panic(err)
	}
	t, err := tokenizer.New(ipa.Dict(), tokenizer.UserDict(udict), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	tokens := t.Analyze("朝顔が咲く", tokenizer.Extended)
	for _, v := range tokens {
		fmt.Printf("%s:\t%s", v.Surface, v.Features())
		if extra := v.UserExtra(); extra != nil {
			fmt.Printf("\t extra: tokens %+v, readings %+v", extra.Tokens, extra.Readings)
		}
		fmt.Println()
	}
}

Output:

朝顔:	[あさ かお 朝/顔 あさ/かお]	 extra: tokens [朝 顔], readings [あさ かお]
が:	[助詞 格助詞 一般 * * * が ガ ガ]
咲く:	[動詞 自立 * * 五段・カ行イ音便 基本形 咲く サク サク]
commented

Thank you for making the change! I really appreciate it. 🚀

I can confirm that it works! 🔥

Funny enough, it was working when concatenating kanji but not for my use case. As I was using similar code to yours.

Quick follow up. What's the difference between

tokens := t.Tokenize(kanji) - Which is what I was using before and

tokens := t.Analyze(kanji, tokenizer.Extended) - Which is what you have above.

Thanks!

kagome has some segmentation modes.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words

see. https://github.com/ikawaha/kagome#segmentation-mode-for-search

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal).

I'm sorry for the confusion caused by the use of tokenizer.Extended in the sample code above. Please choose the mode that best suits your environment (Normal or Search mode is recommended).

commented

Will do. Thanks again!