explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizer.add_special_case raises KeyError

soldni opened this issue · comments

The usage example provided in the documentation for Tokenizer.add_special_case raises a KeyError.

Steps to reproduce:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me' }])

# Traceback (most recent call last):
#   File "test.py", line 13, in <module>
#     ORTH: u'me' }])
#   File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 340, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)
# KeyError: 'F'

Environment

  • Operating System: Ubuntu 16.04 / macOS 10.12.1
  • Python Version Used: CPython 3.5.2
  • spaCy Version Used: 1.2.0
  • Environment Information: n/a

A bit of follow up: I was going through the definition for spacy.vocab.Vocab.make_fused_token, and it seems that code expects ORTH to be equal to 'F', POS to be equal to 'pos', and LEMMA to be equal to 'L'; however, ORTH equals 65, POS equals 74, and LEMMA equals 73.

I am not sure if the values expected by make_fused_tokens are intentionally different from those defined in spacy.symbols.

EDIT: Even when replacing keys for token_attrs argument as described above, I still encounter an error:

import spacy

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('gimme',
    [
        {
            'F': 'gim',
            'L': 'give',
            'pos': 'VERB'},
        {
            'F': 'me' }])

# Traceback (most recent call last):
#  File "test.py", line 13, in <module>
#    'F': 'me' }])
#  File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 342, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)
#  File "spacy/morphology.pyx", line 39, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)
# KeyError: 97

Thanks for this.

The docs have gotten ahead of the library here — the current/old behaviour is pretty inconsistent, so I wrote up the intended usage, but haven't had time to fix it yet. Will definitely have this resolved in the next release, which should be up this week.

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.