explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing capitalized acronyms

charlescearl opened this issue · comments

I came across this issue earlier.

NLP = spacy.en.English()
for sent in NLP(u"I like learning about CIA. Don't you?").sents:
    print sent

=> I like learning about CIA. Don't you?

That is CIA. is one token.

But

for sent in NLP(u"I like learning about cia. Don't you?").sents:
    print sent

=>
I like learning about cia.
Don't you?

It seems that a workaround for this, in entity parsing is to allow for the capitalized entities have attached punctuation if at the end of sentence.

for tok in NLP(u"I like learning about CIA. Don't you?"):
    print "{} {}".format(tok.text, tok.ent_type_ if tok.ent_type_ else "Not an entity")

=>

I Not an entity
like Not an entity
learning Not an entity
about Not an entity
CIA. ORG
Do Not an entity
n't Not an entity
you Not an entity
? Not an entity
commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.