ikawaha / kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Support] Is there an english version of the docs

CaptainDario opened this issue · comments

Thank you for this great project!

I really like this project and would like to understand its capabilities better. Therefore I am wondering if there is an English version available of the docs?

There is no English documentation available.

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Could you also tell me what are the pros/cons of using the different dictionaries?

Thank you very much!

Ok, figured the segmentation modes out myself.
I am using tokenizer.Analyze()

@CaptainDario

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

  • Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

Could you also tell me what are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

  • mr.mcintoshandmr.mcnamara --> Mr. Mc into sh and Mr. Mc namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness. I hope this helps. 🤞

@KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

@CaptainDario Indeed. There is nothing better than better documentation!

@ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?