Tokenizer.add_special_case raises KeyError

Question

Tokenizer.add_special_case raises KeyError

soldni opened this issue 8 years ago · comments

The usage example provided in the documentation for Tokenizer.add_special_case raises a KeyError.

Steps to reproduce:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me' }])

# Traceback (most recent call last):
#   File "test.py", line 13, in <module>
#     ORTH: u'me' }])
#   File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 340, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)
# KeyError: 'F'

Environment

Operating System: Ubuntu 16.04 / macOS 10.12.1
Python Version Used: CPython 3.5.2
spaCy Version Used: 1.2.0
Environment Information: n/a

Luca Soldaini · Answer 1 · Thu Nov 24 2016 02:16:03 GMT+0800 (China Standard Time)

A bit of follow up: I was going through the definition for spacy.vocab.Vocab.make_fused_token, and it seems that code expects ORTH to be equal to 'F', POS to be equal to 'pos', and LEMMA to be equal to 'L'; however, ORTH equals 65, POS equals 74, and LEMMA equals 73.

I am not sure if the values expected by make_fused_tokens are intentionally different from those defined in spacy.symbols.

EDIT: Even when replacing keys for token_attrs argument as described above, I still encounter an error:

import spacy

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('gimme',
    [
        {
            'F': 'gim',
            'L': 'give',
            'pos': 'VERB'},
        {
            'F': 'me' }])

# Traceback (most recent call last):
#  File "test.py", line 13, in <module>
#    'F': 'me' }])
#  File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 342, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)
#  File "spacy/morphology.pyx", line 39, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)
# KeyError: 97

Matthew Honnibal · Answer 2 · Thu Nov 24 2016 07:17:56 GMT+0800 (China Standard Time)

Thanks for this.

The docs have gotten ahead of the library here — the current/old behaviour is pretty inconsistent, so I wrote up the intended usage, but haven't had time to fix it yet. Will definitely have this resolved in the next release, which should be up this week.

lock · Answer 3 · Wed May 09 2018 14:38:19 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.