explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KeyError when adding special tokens

fmfn opened this issue · comments

I am trying to run the example of adding special tokens to the tokenizer and getting the following keyerror:

<ipython-input-3-3c4362d5406f> in <module>()
      8             POS: u'VERB'},
      9         {
---> 10             ORTH: u'me'}])
     11 assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
     12 assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']

/Users/<user>/venvs/general/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)()

/Users/<user>/venvs/general/lib/python3.5/site-packages/spacy/vocab.pyx in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)()

KeyError: 'F'

The code used is the following:

import spacy
from spacy.attrs import ORTH, POS, LEMMA

nlp = spacy.load("en", parser=False)

assert [w.text for w in nlp(u'gimme that')] == [u'gimme', u'that']
nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me'}])
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']

Am I missing something here?

System info:

  • MacOS
  • python3.5.2
  • spacy 1.2.0

Sorry about this — the docs got a bit ahead of the code here. The docs describe how the feature should work, and will work shortly (I'll probably fix it over the weekend).

At the moment you can use the key "F" instead of ORTH, "L" instead of LEMMA, and "pos" instead of POS.

Nice!

I got it to work by passing 'F' and working backwards, after I traced the make_fused_token method. But "L" and "P" were extra hidden.

Thanks for the lightning reply and superb work.

After changing it to:

nlp.tokenizer.add_special_case(
    u'gimme',
    [
        {
            "F": u'gim',
            "L": u'give',
            "pos": u'VERB'
        },
        {
            "F": u'me',
        }
    ]
)

I get:

KeyError                                  Traceback (most recent call last)
<ipython-input-6-df7b9eb25a34> in <module>()
      8         },
      9         {
---> 10             "F": u'me',
     11         }
     12     ]

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/tokenizer.pyx in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)()

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/vocab.pyx in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)()

/Users/<>/venvs/general/lib/python3.5/site-packages/spacy/morphology.pyx in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)()

KeyError: 97

This should now be fixed on master. Thanks for your patience.

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.