Times such as "7pm" tokenized wrong

Question

Times such as "7pm" tokenized wrong

matthayes opened this issue 8 years ago · comments

There appears to be a bug in how times are tokenized for English.

nlp = spacy.load("en")
doc = nlp("We're meeting at 7pm.")

for token in doc:
    print(token, token.pos_, token.lemma_)

This produces:

We PRON -PRON-
're VERB 're
meeting VERB meet
at ADP at
IS_TITLE PROPN is_title
pm NOUN pm
. PUNCT .

Instead of IS_TITLE PROPN is_title I was expecting 7 NUM 7, which is what you get if you used 7 pm instead (with a space in between). I see that TOKENIZER_EXCEPTIONS includes a number of exceptions to handle this type of case so I'm confused why it doesn't work. Also it seems that the "7" should be preserved instead of being replaced with IS_TITLE.

Your Environment

Operating System: Mac OSX 10.11.6
Python Version Used: 3.5.2
spaCy Version Used: 1.5.0
Environment Information: English data version appears to be 1.1.0 given that I see the path spacy/data/en-1.1.0 under site-packages.

Matthew Hayes · Answer 1 · Thu Jan 12 2017 12:53:23 GMT+0800 (China Standard Time)

It appears that the number in the time is somehow being mapped to the ith element from IDS in attrs.pyx:

IDS = {
    "": NULL_ATTR,
    "IS_ALPHA": IS_ALPHA,
    "IS_ASCII": IS_ASCII,
    "IS_DIGIT": IS_DIGIT,
    "IS_LOWER": IS_LOWER,
    "IS_PUNCT": IS_PUNCT,
    "IS_SPACE": IS_SPACE,
    "IS_TITLE": IS_TITLE,
    "IS_UPPER": IS_UPPER,

For example, "8am" becomes IS_UPPER.

Matthew Hayes · Answer 2 · Thu Jan 12 2017 13:13:59 GMT+0800 (China Standard Time)

I think the issue is in language_data.py. The hour here should be converted to a string. I'm assuming when it is a number it becomes a lookup into IDS.

        exc["%dam" % hour] = [
            {ORTH: hour},
            {ORTH: "am", LEMMA: "a.m."}
        ]

When I add this special case to override the existing rule it works:

nlp.tokenizer.add_special_case(
    '7pm',
    [
        {
            ORTH: '7',
            LEMMA: '7',
            POS: 'NUM'
        },
        {
            ORTH: 'pm',
            LEMMA: 'p.m.',
            POS: 'NOUN'
        }
    ])

Matthew Honnibal · Answer 3 · Thu Jan 12 2017 17:50:59 GMT+0800 (China Standard Time)

Thanks, your analysis is definitely correct. Fixing.

Ines Montani · Answer 4 · Thu Jan 12 2017 18:50:11 GMT+0800 (China Standard Time)

Issue fixed and regression test passes! The fix should be included in the next release (coming later today).

lock · Answer 5 · Wed May 09 2018 12:38:57 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.