explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenization issues

mfelice opened this issue · comments

Tokenization seems incorrect in a number of cases:

  1. Tokens incorrectly include punctuation at the beginning or in the middle. Punctuation at the end seems to be handled correctly, though. E.g.

Hello,world is currently kept as one token but should be Hello , world
.,;:hello:!.world is currently kept as one token but should be . , ; : hello : ! . world

  1. The dot seems to cause particular problems at the beginning of a token:

.Hello world. gives .Hello world . (but should be . Hello world .).

I suppose dots are preserved as part of a token in case this make up an acronym, but they should not be allowed at the beginning. Basically, no punctuation should be allowed at the beginning, middle or end, except hyphens/dashes/en-dashes in the middle for compounds (as pointed out in #302) and dots for acronyms (in the middle or end).

  1. Related to the above and following up from #325, there should be some disambiguation to determine whether a dot is a full stop or part of an acronym/abbreviation when it appears at the end. Maybe check if the token has some other dot already? E.g.

a.m. > a.m.
CIA. > CIA .
K.G.B. > K.G.B.
.A. > . A .
.AB. > . AB .
.AB.C > . AB . C
.AB.C. > . AB.C.

Something like E.ON (the energy supplier) would cause trouble, but it would be a rare exception (in fact, it should be E·ON).

  1. Related to 2) and #302, you should allow any number of hyphens/dashes/en-dashes in tokens.

next-of-kin is currently next - of-kin
three-year-old is currently three - year-old
jack-in-the-box is currently jack - in-the-box

But they should be one word. The third case is particularly interesting, as it generates a token with more than one hyphen (in-the-box). Clearly, the tokenizer seems to split only on the first hyphen.

  1. The word cannot is currently tokenized as can not. Strict grammarians would say there is a difference between these two forms, so cannot should not be tokenized as can not. I understand spaCy might not want to make this distinction, in which case I wonder how I can force the tokenizer to keep cannot as one word without modifying any files. Ideally, I'd like to add this exception dynamically while/after loading spaCy.

Thank you.

Thanks, am thinking these through.

Currently the tokenizer is fairly conservative in segmentation --- it tends to under segment, rather than over segment. I think we should rather switch to often over segmenting, and then use the .merge() function to merge numeric entities, dates, emails, urls etc back into single tokens.

This sort of change takes some experimentation, though. It's at least partly an empirical question, because it's not easy to intuit what cases are common. I'll keep this ticket open and update when I've had a chance to experiment.

Are you aware of any quick fix for (3)?

@honnibal : I found another tokenization issue yesterday that was doing my head in. Possibly it's already mentioned in the above.

Turn on the tv. = turn on the tv . (correct)
Turn on the TV. = turn on the TV. (the trailing dot is made part of POBJ)

This issue should be fixed with the recent updates to the language data.

Re 1./2./3. Hello,world and similar tokens, uppercase abbreviations (K.G.B., E.ON, TV) and common exceptions (a.m.) are now handled correctly. When it comes to unexpected input like .,;:hello:!.world or even .AB.C., we want to stay conservative in segmenting the punctuation.

Re 4. The inconsistency should now be fixed – unless an exception is added, all infix hyphens are split. By default, all tokens are handled this way. If you want to add custom tokenization rules, for example to keep next-of-kin as one token, you can whitelist specific words, or override the default rules with your own regular expressions.

Re 5. To stay consistent with the parser training data, spaCy follows the Penn Treebank tokenization scheme, which splits cannot into two tokens. This behaviour can be modified via the tokenizer exceptions, though.

Sorry to bump this thread, but it seems like the special cases for english (e.g. Mr.) do not work properly in a lowercase setting.

In [1]: import spacy

In [2]: en_nlp = spacy.load('en')

In [3]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.")]
Out[3]: ['Mr.', 'Smith', 'says', 'hello', '.']

In [4]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.".lower())]
Out[4]: ['mr', '.', 'smith', 'says', 'hello', '.']

I am aware that I could just add some exceptions, but I don't think I could catch them all; I was wondering if there's any quick fix on your side

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.