qinwf / jiebaR

Chinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nature tagging with tokenized words

yipcma opened this issue · comments

Hi @qinwf ,

First off, thanks a lot for the wonderful package :)

I'd love to know if there's a way to nature tag already tokenized words (say, in a vector).

Currently when I run the tagger, it will breakdown my already tokenized vector of words. My use-case is to tag nature for words in my user-dictionary so they have the right nature instead of the ones I gave in dictionary creation.

Thanks again and look forward to your insights.

Cheers,

Andrew

commented

Hi, Andrew.

I just added a vector_tag function, you can install the package from GitHub to use this new function.

> cc = worker()
> vector_tag(c("这","是","北京"),cc)
     r      v     ns 
  "这"   "是" "北京" 

For now, the tagging process in this package is very simple, and it just read the dictionary and find the one and only tag for each word in the dictionary. So the tags are not very accurate.

There is a THULACR package, which is not yet published to public repo. If you want to use that, I can add you to the private repo. Although THULACR package will not be able to do tagging on vector of words, and it will be able to tag a sentence.

@qinwf
Ican't wait to test the new package - THULACR. It's said that THULAC is much better than cppjieba in Precision . Here is the reference link .

@qinwf This is very exciting. I'd love to try THULACR out. Please add me to the repo :) And thanks again for the wonderful work on NLP support in R.