Tokenizer.add_special_case raises KeyError
soldni opened this issue · comments
The usage example provided in the documentation for Tokenizer.add_special_case raises a KeyError
.
Steps to reproduce:
import spacy
from spacy.symbols import ORTH, LEMMA, POS
nlp = spacy.load('en')
nlp.tokenizer.add_special_case(u'gimme',
[
{
ORTH: u'gim',
LEMMA: u'give',
POS: u'VERB'},
{
ORTH: u'me' }])
# Traceback (most recent call last):
# File "test.py", line 13, in <module>
# ORTH: u'me' }])
# File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
# File "spacy/vocab.pyx", line 340, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)
# KeyError: 'F'
Environment
- Operating System: Ubuntu 16.04 / macOS 10.12.1
- Python Version Used: CPython 3.5.2
- spaCy Version Used: 1.2.0
- Environment Information: n/a
A bit of follow up: I was going through the definition for spacy.vocab.Vocab.make_fused_token
, and it seems that code expects ORTH
to be equal to 'F'
, POS
to be equal to 'pos'
, and LEMMA
to be equal to 'L'
; however, ORTH
equals 65
, POS
equals 74
, and LEMMA
equals 73
.
I am not sure if the values expected by make_fused_tokens
are intentionally different from those defined in spacy.symbols
.
EDIT: Even when replacing keys for token_attrs
argument as described above, I still encounter an error:
import spacy
nlp = spacy.load('en')
nlp.tokenizer.add_special_case('gimme',
[
{
'F': 'gim',
'L': 'give',
'pos': 'VERB'},
{
'F': 'me' }])
# Traceback (most recent call last):
# File "test.py", line 13, in <module>
# 'F': 'me' }])
# File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
# File "spacy/vocab.pyx", line 342, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)
# File "spacy/morphology.pyx", line 39, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)
# KeyError: 97
Thanks for this.
The docs have gotten ahead of the library here — the current/old behaviour is pretty inconsistent, so I wrote up the intended usage, but haven't had time to fix it yet. Will definitely have this resolved in the next release, which should be up this week.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.