[Support] Is there an english version of the docs

Question

[Support] Is there an english version of the docs

CaptainDario opened this issue 2 years ago · comments

Thank you for this great project!

I really like this project and would like to understand its capabilities better. Therefore I am wondering if there is an English version available of the docs?

ikawaha · Answer 1 · Mon Jul 18 2022 08:11:14 GMT+0800 (China Standard Time)

There is no English documentation available.

CaptainDario · Answer 2 · Tue Jul 19 2022 00:09:15 GMT+0800 (China Standard Time)

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Could you also tell me what are the pros/cons of using the different dictionaries?

Thank you very much!

CaptainDario · Answer 3 · Tue Jul 19 2022 02:15:28 GMT+0800 (China Standard Time)

Ok, figured the segmentation modes out myself.
I am using tokenizer.Analyze()

KEINOS · Answer 4 · Tue Jul 26 2022 21:56:34 GMT+0800 (China Standard Time)

@CaptainDario

Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.

As you may know, most Asian texts are not word-separated. The word "wakati" means "word divide" in Japanese. Thus, wakati helps to divide the text into word tokens. Imagine the following.

Wakati("thistextwritingissomewhatsimilartotheasianstyle.") --> this text writing is somewhat similar to the asian style.

The Tokenizer.Wakati() is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.

The Tokenizer.Tokenize() is similar to Wakati(). But each wakatized(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.

Could you also tell me what are the pros/cons of using the different dictionaries?

In order to do the wakati thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.

The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.

The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.

mr.mcintoshandmr.mcnamara --> Mr. Mc into sh and Mr. Mc namara or Mr. McIntosh and Mr. McNamara

And the "cons" would be memory usage and slowness. I hope this helps. 🤞

CaptainDario · Answer 5 · Thu Jul 28 2022 20:00:05 GMT+0800 (China Standard Time)

@KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

KEINOS · Answer 6 · Thu Aug 04 2022 08:14:34 GMT+0800 (China Standard Time)

@CaptainDario Indeed. There is nothing better than better documentation!

@ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?