[Support] Is there an english version of the docs
CaptainDario opened this issue · comments
Thank you for this great project!
I really like this project and would like to understand its capabilities better. Therefore I am wondering if there is an English version available of the docs?
There is no English documentation available.
Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati
fmt.Println("---wakati---")
seg := t.Wakati("すもももももももものうち")
fmt.Println(seg)
// tokenize
fmt.Println("---tokenize---")
tokens := t.Tokenize("すもももももももものうち")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
Could you also tell me what are the pros/cons of using the different dictionaries?
Thank you very much!
Ok, figured the segmentation modes out myself.
I am using tokenizer.Analyze()
Could you explain to me how to use the different segmentation modes?
I do not understand where I would need to add it to the example.
As you may know, most Asian texts are not word-separated. The word "wakati
" means "word divide" in Japanese. Thus, wakati
helps to divide the text into word tokens. Imagine the following.
Wakati("thistextwritingissomewhatsimilartotheasianstyle.")
-->this text writing is somewhat similar to the asian style.
The Tokenizer.Wakati()
is used to simply divide the text into space-separated-words. Used to create a meta data for a Full-text search. E.g. FTS5 in SQLite3.
The Tokenizer.Tokenize()
is similar to Wakati()
. But each wakatized
(?) chunks contains more information. Mostly used to analyze the grammar, text-lint and etc.
Could you also tell me what are the pros/cons of using the different dictionaries?
In order to do the wakati
thing, a word dictionary is needed to determine the proper names, nouns, etc. of a word.
The difference between dictionaries is simply the number of words. The default built-in dictionary supports most of the important proper names, nouns, verbs, etc.
The "pros" of using different dictionaries is, therefore, that they can separate words more accurately. Imagine the following.
mr.mcintoshandmr.mcnamara
-->Mr. Mc into sh and Mr. Mc namara
orMr. McIntosh and Mr. McNamara
And the "cons" would be memory usage and slowness. I hope this helps. 🤞
@KEINOS Thank you very much!
Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.
@CaptainDario Indeed. There is nothing better than better documentation!
@ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?