nature tagging with tokenized words

Question

nature tagging with tokenized words

yipcma opened this issue 8 years ago · comments

First off, thanks a lot for the wonderful package :)

I'd love to know if there's a way to nature tag already tokenized words (say, in a vector).

Currently when I run the tagger, it will breakdown my already tokenized vector of words. My use-case is to tag nature for words in my user-dictionary so they have the right nature instead of the ones I gave in dictionary creation.

Thanks again and look forward to your insights.

Cheers,

Andrew

qinwf · Answer 1 · Fri Jul 22 2016 19:08:54 GMT+0800 (China Standard Time)

Hi, Andrew.

I just added a vector_tag function, you can install the package from GitHub to use this new function.

> cc = worker()
> vector_tag(c("这","是","北京"),cc)
     r      v     ns 
  "这"   "是" "北京"

For now, the tagging process in this package is very simple, and it just read the dictionary and find the one and only tag for each word in the dictionary. So the tags are not very accurate.

There is a THULACR package, which is not yet published to public repo. If you want to use that, I can add you to the private repo. Although THULACR package will not be able to do tagging on vector of words, and it will be able to tag a sentence.

BruceZhao · Answer 2 · Fri Jul 22 2016 19:28:55 GMT+0800 (China Standard Time)

@qinwf
Ｉcan't wait to test the new package - THULACR. It's said that THULAC is much better than cppjieba in Precision . Here is the reference link .

Andrew Yip · Answer 3 · Sat Jul 23 2016 20:15:31 GMT+0800 (China Standard Time)

@qinwf This is very exciting. I'd love to try THULACR out. Please add me to the repo :) And thanks again for the wonderful work on NLP support in R.