explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Times such as "7pm" tokenized wrong

matthayes opened this issue · comments

There appears to be a bug in how times are tokenized for English.

nlp = spacy.load("en")
doc = nlp("We're meeting at 7pm.")

for token in doc:
    print(token, token.pos_, token.lemma_)

This produces:

We PRON -PRON-
're VERB 're
meeting VERB meet
at ADP at
IS_TITLE PROPN is_title
pm NOUN pm
. PUNCT .

Instead of IS_TITLE PROPN is_title I was expecting 7 NUM 7, which is what you get if you used 7 pm instead (with a space in between). I see that TOKENIZER_EXCEPTIONS includes a number of exceptions to handle this type of case so I'm confused why it doesn't work. Also it seems that the "7" should be preserved instead of being replaced with IS_TITLE.

Your Environment

  • Operating System: Mac OSX 10.11.6
  • Python Version Used: 3.5.2
  • spaCy Version Used: 1.5.0
  • Environment Information: English data version appears to be 1.1.0 given that I see the path spacy/data/en-1.1.0 under site-packages.

It appears that the number in the time is somehow being mapped to the ith element from IDS in attrs.pyx:

IDS = {
    "": NULL_ATTR,
    "IS_ALPHA": IS_ALPHA,
    "IS_ASCII": IS_ASCII,
    "IS_DIGIT": IS_DIGIT,
    "IS_LOWER": IS_LOWER,
    "IS_PUNCT": IS_PUNCT,
    "IS_SPACE": IS_SPACE,
    "IS_TITLE": IS_TITLE,
    "IS_UPPER": IS_UPPER,

For example, "8am" becomes IS_UPPER.

I think the issue is in language_data.py. The hour here should be converted to a string. I'm assuming when it is a number it becomes a lookup into IDS.

        exc["%dam" % hour] = [
            {ORTH: hour},
            {ORTH: "am", LEMMA: "a.m."}
        ]

When I add this special case to override the existing rule it works:

nlp.tokenizer.add_special_case(
    '7pm',
    [
        {
            ORTH: '7',
            LEMMA: '7',
            POS: 'NUM'
        },
        {
            ORTH: 'pm',
            LEMMA: 'p.m.',
            POS: 'NOUN'
        }
    ])

Thanks, your analysis is definitely correct. Fixing.

Issue fixed and regression test passes! The fix should be included in the next release (coming later today).

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.