trinker / tagger

Part of speech (POS) tagger

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tag_pos splitting tokens inapproptiately

Auburngrads opened this issue · comments

@trinker just started using tagger and really like it so far - thanks for your effort.

I noticed in trying to add parts of speech to an existing data.frame of words that the number of pos elements being output was greater that the number of words being input. So far I've traced it down to two things.

  1. Acronyms with three or more letters - separated by periods. For example, U.S.C. (United States Code) gets split into U.S. and C.
  2. Words with apostrophes get split at the apostrophe.

Looking at the package documentation, I haven't found a way to change this behavior. Is this a bug or am I missing something?

Do you have a reproducible example? I don't see either of these issues with either engine.

I get:

 > tag_pos(' For example, U.S.C. (United States Code) gets split didn\'t')
[1] "For/IN example/NN ,/, U.S.C./NNP (/-LRB- United/NNP States/NNP Code/NNP )/-RRB- gets/VBZ split/NN did/VBD n't/RB"
> 
> tag_pos(' For example, U.S.C. (United States Code) gets split didn\'t',
+     engine = 'coreNLP')
[1] "For/IN example/NN ,/, U.S.C./NNP -LRB-/-LRB- United/NNP States/NNP Code/NNP -RRB-/-RRB- gets/VBZ split/NN did/VBD n't/RB"

Okay, I understand that splitting at the apostrophe is done on purpose. For the acronym issue, here's a reprex of what I'm seeing. You'll note that in my case, I'm tagging after the sentence has been split and the words are stored as a vector.

If engine='coreNLP' is used, U.S.C. is tagged appropriately - otherwise it's split as U.S. and C.

pacman::p_load(stringi)

sentence <- "For example, U.S.C. (United States Code) gets split didn't"
unigrams <- unlist(stringi::stri_extract_all_words(sentence))

tagger::tag_pos(unigrams)
#> [1] "For/IN"         "example/NN"     "U.S./NNP C/NNP" "United/NNP"    
#> [5] "States/NNS"     "Code/NNP"       "gets/VBZ"       "split/NN"      
#> [9] "did/VBD n't/RB"

tagger::tag_pos(unigrams, engine = 'coreNLP')
#> [1] "For/IN"         "example/NN"     "U.S.C/NN"       "United/NNP"    
#> [5] "States/NNPS"    "Code/NNP"       "gets/VBZ"       "split/NN"      
#> [9] "did/VBD n't/RB"

Gotcha...It appears then that you are not using the tagger on sentences thus it is not being used in a way that is inconsistent with tag_pos's design. The argument text.var states:

text.var The text string variable.

You are using it on tokens not text. I'm closing this as it falls under using the program outside of its intended use. I suspect you're trying to make the output into something that fits the tidytext package's framework. If that's the case you can make the output tidy after you've tagged as follows:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(c(
    "trinker/termco", 
    "trinker/coreNLPsetup",        
    "trinker/tagger"
))

pacman::p_load(textshape)


sentence <- c(
    "For example, U.S.C. (United States Code) gets split didn't",
    "Another sentence. And still one more."
)


out <- tag_pos(sentence)

textshape::tidy_list(out, 'element', 'pos', 'token')

##     element token      pos
##  1:       1    IN      For
##  2:       1    NN  example
##  3:       1     ,        ,
##  4:       1   NNP   U.S.C.
##  5:       1 -LRB-        (
##  6:       1   NNP   United
##  7:       1   NNP   States
##  8:       1   NNP     Code
##  9:       1 -RRB-        )
## 10:       1   VBZ     gets
## 11:       1    NN    split
## 12:       1   VBD      did
## 13:       1    RB      n't
## 14:       2    DT  Another
## 15:       2    NN sentence
## 16:       2     .        .
## 17:       2    CC      And
## 18:       2    RB    still
## 19:       2    CD      one
## 20:       2   JJR     more
## 21:       2     .        .

If this doesn't solve the problem feel free to reopen the issue.