KeyError when adding special tokens

Question

KeyError when adding special tokens

fmfn opened this issue 8 years ago · comments

I am trying to run the example of adding special tokens to the tokenizer and getting the following keyerror:

<ipython-input-3-3c4362d5406f> in <module>()
      8             POS: u'VERB'},
      9         {
---> 10             ORTH: u'me'}])
     11 assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
     12 assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']

/Users/<user>/venvs/general/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)()

/Users/<user>/venvs/general/lib/python3.5/site-packages/spacy/vocab.pyx in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)()

KeyError: 'F'

The code used is the following:

import spacy
from spacy.attrs import ORTH, POS, LEMMA

nlp = spacy.load("en", parser=False)

assert [w.text for w in nlp(u'gimme that')] == [u'gimme', u'that']
nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me'}])
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']

Am I missing something here?

System info:

MacOS
python3.5.2
spacy 1.2.0

Matthew Honnibal · Answer 1 · Sat Nov 12 2016 06:49:25 GMT+0800 (China Standard Time)

Sorry about this — the docs got a bit ahead of the code here. The docs describe how the feature should work, and will work shortly (I'll probably fix it over the weekend).

At the moment you can use the key "F" instead of ORTH, "L" instead of LEMMA, and "pos" instead of POS.

fernando · Answer 2 · Sat Nov 12 2016 06:54:10 GMT+0800 (China Standard Time)

Nice!

I got it to work by passing 'F' and working backwards, after I traced the make_fused_token method. But "L" and "P" were extra hidden.

Thanks for the lightning reply and superb work.

fernando · Answer 3 · Sat Nov 12 2016 07:01:13 GMT+0800 (China Standard Time)

After changing it to:

nlp.tokenizer.add_special_case(
    u'gimme',
    [
        {
            "F": u'gim',
            "L": u'give',
            "pos": u'VERB'
        },
        {
            "F": u'me',
        }
    ]
)

I get:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-df7b9eb25a34> in <module>()
      8         },
      9         {
---> 10             "F": u'me',
     11         }
     12     ]

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)()

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/vocab.pyx in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)()

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/morphology.pyx in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)()

KeyError: 97

Matthew Honnibal · Answer 4 · Fri Nov 25 2016 19:45:04 GMT+0800 (China Standard Time)

This should now be fixed on master. Thanks for your patience.

lock · Answer 5 · Wed May 09 2018 10:38:23 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.