Parsing capitalized acronyms
charlescearl opened this issue · comments
I came across this issue earlier.
NLP = spacy.en.English()
for sent in NLP(u"I like learning about CIA. Don't you?").sents:
print sent
=> I like learning about CIA. Don't you?
That is CIA.
is one token.
But
for sent in NLP(u"I like learning about cia. Don't you?").sents:
print sent
=>
I like learning about cia.
Don't you?
It seems that a workaround for this, in entity parsing is to allow for the capitalized entities have attached punctuation if at the end of sentence.
for tok in NLP(u"I like learning about CIA. Don't you?"):
print "{} {}".format(tok.text, tok.ent_type_ if tok.ent_type_ else "Not an entity")
=>
I Not an entity
like Not an entity
learning Not an entity
about Not an entity
CIA. ORG
Do Not an entity
n't Not an entity
you Not an entity
? Not an entity
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.