💫 Better, faster and more customisable matcher

Question

💫 Better, faster and more customisable matcher

ines opened this issue 7 years ago · comments

Related issues: #1567, #1711, #1819, #1939, #1945, #1951, #2042

We're currently in the process of rewriting the match loop, fixing long-standing issues and making it easier to extend the Matcher and PhraseMatcher. The community contributions by @GregDubbin and @savkov have already made a big difference – we can't wait to get it all ready and shipped.

This issue discusses some of the planned new features and additions to the match patterns API, including matching by custom extension attributes (Token._.), regular expressions, set membership and rich comparison for numeric values.

New features

Custom extension attributes

spaCy v2.0 introduced custom extension attributes on the Doc, Span and Token. Custom attributes make it easier to attach arbitrary data to the built-in objects, and let users take advantage of spaCy's data structures and the Doc object as the "single source of truth". However, not being able to match on custom attributes was quite limiting (see #1499, #1825).

The new patterns spec will allow an _ space on token patterns, which can map to a dictionary keyed by the attribute names:

Token.set_extension('is_fruit', getter=lambda token: token.text in ('apple', 'banana'))

pattern = [{'LEMMA': 'have'}, {'_': {'is_fruit': True}}]
matcher.add('HAVING_FRUIT', None, pattern)

Both regular attribute extensions (with a default value) and property extensions (with a getter) will be supported and can be combined for more exact matches.

pattern = [{'_': {'is_fruit': True, 'fruit_color': 'red', 'fruit_rating': 5}}]

Rich comparison for numeric values

Token patterns already allow specifying a LENGTH (the token's character length). However, matching tokens of between five and ten characters previously required adding 6 copies of the exact same pattern, introducing unnecessary overhead. Numeric attributes can now also specify a dictionary with the predicate (e.g. '>' or '<=') mapped to the value. For example:

pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': 5}]          # exact length
pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': {'>=': 5}}]  # length with predicate

The second pattern above will match a token with the entity type ORG that's 5 or more characters long. Combined with custom attributes, this allows very powerful queries combining both linguistic features and numeric data:

# match a token based on custom numeric attributes
pattern = [{'_': {'fruit_rating': {'>': 7}, 'fruit_weight': {'>=': 100, '<': 300}}]

# match a verb with ._.sentiment_score >= 5 and one token on each side
pattern = [{}, {'POS': 'VERB', '_': {'sentiment_score': {'>=': 0.5}}}, {}]

Defining predicates and values as a dictionary instead of a single string like '>=5' allows us to avoid string parsing, and lets spaCy handle custom attributes without requiring the user to specify their types upfront. (While we know the type of the built-in LENGTH attribute, spaCy has no way of knowing whether the value '<3' of a custom attribute should be interpreted as "less than 3", or the heart emoticon.)

Set membership

This is another feature that has been requested before and will now be much easier to implement. Similar to the predicate mapping for numeric values, token attributes can now also be defined as dictionaries. The keys IN or NOT_IN can be used to indicate set membership and non-membership.

pattern = [{'LEMMA': {'IN': ['like', 'love']}}, 
           {'LOWER': {'IN': ['apples', 'bananas']}}]

The above pattern will match a token with the lemma "like" or "love", followed by a token whose lowercase form is either "apples" or "bananas". For example, "loving apples" or "likes bananas". Lists can be used for all non-boolean values, including custom _ attributes:

# verb or conjunction followed by custom is_fruit token
pattern = [{'POS': {'IN': ['VERB', 'CONJ', 'CCONJ']}}, 
           {'_': {'is_fruit': True, 'fruit_color': {'NOT_IN': ['red', 'yellow']}}}]

# set membership of numeric custom attributes
pattern = [{'_': {'is_customer': True, 'customer_year': {'IN': [2018, 2017, 2016]}}}]

# combination of predicates and and non-membership
pattern = [{'_': {'custom_count': {'<=': 100, 'NOT_IN': [12, 66, 79]}}}]

Regular expressions

Using regular expressions within token patterns is already possible via custom binary flags (see #1567). However, this has some inconvenient limitations – including the patterns not being JSON-serializable. If the solution is to add binary flags, spaCy might as well take care of that. The following example is based on the work by @savkov (see #1833):

pattern = [{'ORTH': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}},
           {'LOWER': 'president'}]

'REGEX' as an operator (instead of a top-level property that only matches on the token's text) allows defining rules for any string value, including custom attributes:

# match tokens with fine-grained POS tags starting with 'V'
pattern = [{'TAG': {'REGEX': '^V'}}]

# match custom attribute values with regular expressions
pattern = [{'_': {'country': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}}]

New operators

TL;DR: The new patterns spec will allow two ways of defining properties – attribute values for exact matches and dictionaries using operators for more fine-grained matches.

{
    PROPERTY: value,                  # exact match
    PROPERTY: {OPERATOR: value, ...}  # match with operators
}

The following operators can be used within dictionaries describing attribute values:

Operator	Value type	Description	Example
`==`, `>=`, `<=`, `>`, `<`	int, float	Attribute value is equal, greater or equal, smaller or equal, greater or smaller.	`'LENGTH': {'>': 10}`
`IN`	any	Attribute value is member of a list.	`'LEMMA': {'IN': ['like', 'love']}`
`NOT_IN`	any	Attribute value is not member of a list.	`'POS': {'NOT_IN': ['NOUN', 'PROPN']}`
`REGEX`	unicode	Attribute value matches regular expression.	`'TAG': {'REGEX': '^V'}`

API improvements and bug fixes

See @honnibal's comments in #1945 and the feature/better-faster-matcher branch for more details and implementation examples.

Other fixes

#1711: Remove hard-coded length limit of 10 on the PhraseMatcher.
#1939: Fix pickling of PhraseMatcher.
#1951: Matcher.pipe should yield matches instead of Doc objects.
#2042: Support deleting rules in PhraseMatcher.
Accept "TEXT" as an alternative to "ORTH" (for consistency).

Ryan Wheeler · Answer 1 · Tue Feb 13 2018 01:35:13 GMT+0800 (China Standard Time)

Serialize Phrase Matcher would be great.

ohenrik · Answer 2 · Tue Feb 13 2018 19:25:40 GMT+0800 (China Standard Time)

Thank you! This will be awesome! I basically need all these new features! Do you have any estimates on when this will be available to try out?

ohenrik · Answer 3 · Tue Feb 13 2018 19:54:33 GMT+0800 (China Standard Time)

A further improvement on this might be to enable running a matcher after other matchers. i.e to first recognize a pattern, add that patten as an attribute and then run a matcher on that attribute again. Example:

I "walked" to the buss stop.
I (tag), " (quote), walked (verb), " (quote), etc...

convert it to:

I (noun), "walked" (verb), to (adv), etc...

And then recognize this pattern:

Noun, Verb, Adv

This might also simplify pattern creation because you can generalize and match simple patterns and re use them in later patterns.

For example i want all "adj + verb" patterns and "verb + adj" be matched and marked as "ACTION". Then use this in another matcher later:

They run fast.
He is a fast runner.

Then match something like this:
[{"POS": {"IN": ["PRON", "DET"]}, {}, {"MATCH": "ACTION"}]

Where the key "MATCH" is previous matches and ACTION is the previous match id.

Lastly this would also solve issues related to "or" patterns for petterns that depend on two or more tokens. Since you can then combine the new set feature "IN"/"NOT_IN" with previously matched patterns.

This would allow for much more complex types of patterns. I hope you will consider this extension and ill be happy to contribute to developing this. I think all that is needed is to set a hierarchy variable to match patterns, and then run the matchers in each hierarchy level after one another, storing previous matches as attributes.

EDIT I realized after writing this that this can be done using the new logic for custom attributes "_", if I’m not mistaken. By adding matches to a custom variable on the doc object and then running the second set of matchers after this. However it is a bit much custom logic needed.

Sasho Savkov · Answer 4 · Tue Feb 13 2018 20:21:44 GMT+0800 (China Standard Time)

@ohenrik isn't this already possible through an _ parameter? Something like the example with a country pattern above:

# match custom attribute values with regular expressions
pattern = [{'_': {'country': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}}]

ohenrik · Answer 5 · Tue Feb 13 2018 20:24:34 GMT+0800 (China Standard Time)

An alternative to the hierarchical matching patterns would be to allow for sub patterns directly in the patterns:

At the moment every rule is for one token. So like this:
Pattern = [token_rule, token_rule, token_rule]

But by adding a sub list you could actually create a rule set that can use all the other operators in combination:

Pattern = [ [token_rule, token_rule, token_rule], [token_rule, token_rule, token_rule] ]

Meaning this would be possible:
Pattern = ["pattern NOT_IN": [sub_pattern, sub_pattern], token_rule]

ohenrik · Answer 6 · Tue Feb 13 2018 20:26:40 GMT+0800 (China Standard Time)

@savkov Yes, i realized it will be possible, but a bit much hassle to get it working. I think having nested patterns would solve this better, as i have proposed above. The example might not be that clear though

Matthew Honnibal · Answer 7 · Tue Feb 13 2018 23:44:25 GMT+0800 (China Standard Time)

@ohenrik @savkov @GregDubbin @thomasopsomer
The feature/better-faster-matcher branch now has a module spacy.matcher2 with the reimplemented Matcher. It passes all tests, including ones which had been marked xfail. If you build the branch, you should be able to compare results of new and old by just importing from spacy.matcher2 and spacy.matcher.

Now we need tests with non-trivial grammars. Benchmarks would be particularly useful. I'm hoping the new matcher is faster, but it's hard to guess. If you have a non-trivial grammar handy, could you try it out and see how it looks?

Once we're confident the reimplementation is correct, we can push a v2.0.8 that makes the replacement, to fix #1945 . Then we can implement the fancy new pattern logic for v2.1.0 🎉

Matthew Honnibal · Answer 8 · Wed Feb 14 2018 00:09:14 GMT+0800 (China Standard Time)

Hm so far the PhraseMatcher seems much slower. Running more tests.

ohenrik · Answer 9 · Wed Feb 14 2018 17:24:38 GMT+0800 (China Standard Time)

@honnibal I can test it out too, if you still need someone to test it for speed? I don’t have code using the PhraseMatcher though, but most of the other old functionality i can test against the new.

Matthew Honnibal · Answer 10 · Wed Feb 14 2018 19:34:31 GMT+0800 (China Standard Time)

@ohenrik Yes please. I just pushed an update that fixes the speed problems on my PhraseMatcher benchmark, using the data from https://github.com/mpuig/spacy-lookup

Taking the first 1000 documents of IMDB train, and using the 400k patterns in spacy-lookup, we get:

System	Matches	words per second
matcher	24,861	2,173,959
matcher2	24,861	2,001,323
spacy-lookup	23,693	104,664

~~So, speed is fine (1.7m tokens per second) -- but I'm getting fewer matches, so it seems like something is wrong.~~ Edit: Fixed.

Once the doc is tokenized, using the PhraseMatcher is 20x faster than using a string-based algorithm like Aho-Corasick. This does make sense: the inputs are significantly shorter in token-space than in string-space, so we get a big advantage from matching over tokens. If we measure from raw strings, Flashtext comes out faster, especially with the current efficiency problems in the tokenizer.

I think the difference in number of matches between the PhraseMatcher and spacy-lookup comes from handling of overlaps. I think it's nice to be able to return all matches, even if the domain of two patterns overlaps. This is different from how regex engines, but I think it's better semantics for what we want here. If I have patterns for "fresh food" and "food delivery", we do want matches for both if the input text is "fresh food delivery".

ohenrik · Answer 11 · Thu Feb 15 2018 18:07:24 GMT+0800 (China Standard Time)

I have patterns for "fresh food" and "food delivery", we do want matches for both if the input text is "fresh food delivery".

@honnibal I agree, in most of the use cases i have and foresee i would want to include overlapping matches :) A switch for including overlapping or not would be a bonus though.

Related to the PhraseMatches. Is it possible to add patterns to this? I.e. not just match exact words in sequence, but match token TAG or POS sequences too?

Matthew Honnibal · Answer 12 · Thu Feb 15 2018 18:42:23 GMT+0800 (China Standard Time)

@ohenrik The PhraseMatcher exploits the fact that the TokenC struct holds a pointer to a LexemeC from the vocabulary. This means that every instance of the word "Bill" is pointing to the same LexemeC data. We can then add a flag on that lexeme for the value B-PER, and run the matcher with only a few patterns. Once we have the match candidates, we filter them so that we don't get combinations of start/middle/end sequences that weren't in the actual phrases.

The trick fundamentally relies on the fact that the lexemes are indexed by ORTH. If we try to make the pattern a sequence of, say, LEMMA attributes, we'll run into the problem that the token's lexeme is indexed by the token's ORTH. For the ORTH attribute, we have a set of flags we can look up -- for the LEMMA, we don't.

How about this. We could build a temporary Doc that uses the attribute you're interested in as the ORTH value:

lemmas = Doc(doc.vocab, words=[token.lemma_ for token in doc])

Then we can pass this lemmas instance into the PhraseMatcher. The start and end offsets of the tokens will be correctly aligned to the original Doc, so we can return them directly. Constructing the secondary Doc object should be fairly cheap. We could build this into the PhraseMatcher, so that we can finally support choice of attribute. If a non-ORTH attribute is set, the PhraseMatcher would construct the temporary doc internally.

(I wish I'd thought of this solution earlier --- it would've worked all along!)

ohenrik · Answer 13 · Thu Feb 15 2018 18:55:55 GMT+0800 (China Standard Time)

@honnibal Nice explanation and solution! 🎉 Thank you! 🙂

Matthew Honnibal · Answer 14 · Thu Feb 15 2018 23:17:08 GMT+0800 (China Standard Time)

The feature/better-faster-matcher branch now has the improved code in matcher.pyx, so it could be merged. The only thing right now is I'm worried that the changed semantics will be a breaking change for some people, especially the inclusion of non-longest matches. I've added a filtering function, and I guess we can have an option that's on by default, that the PhraseMatcher disables.

Now that the Matcher is returning all matches for quantifiers, we can remove the length limit in the PhraseMatcher 🎉 . Previously, if we had a pattern like [{ORTH: "a"}, {ORTH: "b", "OP": "+"}, {"ORTH": "c"}], and we had a sequence "abbc", we would only return a match for abbc. This meant we couldn't use quantifiers in the PhraseMatcher, because we would need "abc" as a candidate phrase as well. Now that the matcher returns both candidates, we can make the PhraseMatcher patterns use quantifiers, removing the length limit.

This change currently makes the phrase-matcher a bit less efficient: speed drops from 2m tokens per second to 1.3m. This is still much faster than the tokenizer, and we can recover the missing speed if necessary by having more specific patterns for more lengths.

Ines Montani · Answer 15 · Fri Feb 16 2018 00:01:47 GMT+0800 (China Standard Time)

We could build this into the PhraseMatcher, so that we can finally support choice of attribute. If a non-ORTH attribute is set, the PhraseMatcher would construct the temporary doc internally.

Suggested usage / user-facing API for now:

matcher = PhraseMatcher(nlp.vocab, attribute='POS')
matcher.add('PATTERN', None, nlp(u"I like cats"))  # kinda useless example, but you get the idea

matches = matcher(nlp(u"You love dogs"))
assert len(matches) == 1

ngawangtrinley · Answer 16 · Wed Apr 11 2018 10:13:48 GMT+0800 (China Standard Time)

"We could build this into the PhraseMatcher, so that we can finally support choice of attribute." This would make the Rule based Matcher Explorer much more user friendly for non-linguists (i.e. buddhist monks creating spellchecking rules for Tibetan). Type a real sentence, convert it to a pattern, tweak it till you love it.

Kyoungrok Jang · Answer 17 · Sun May 06 2018 21:55:58 GMT+0800 (China Standard Time)

I hope I could pickle PhraseMatcher soon. Quite inefficient to add millions of rules every time.

Kyoungrok Jang · Answer 18 · Tue May 29 2018 09:33:51 GMT+0800 (China Standard Time)

So can't we expect PhraseMatcher become 'pickle-able' in the near future? I think the feature is essential to reuse a huge matcher.

Matan Addam · Answer 19 · Sun Jun 03 2018 01:09:40 GMT+0800 (China Standard Time)

@ngawangtrinley actually, is the Rule based Matcher Explorer something that can be run locally, or is it just an online demo? I think even for the non-buddhist Tibetan developers, it would be a time and sanity saver. Would be nice if it could be run on-prem. Maybe should be a separate ticket.

Ines Montani · Answer 20 · Sun Jun 03 2018 01:30:05 GMT+0800 (China Standard Time)

@matanster The UI is mostly just an online demo, but the back-end code is available here:
https://github.com/explosion/spacy-services/tree/master/matcher

Matan Addam · Answer 21 · Sun Jun 03 2018 01:33:31 GMT+0800 (China Standard Time)

@ines thanks for the code reference! It's actually the UI that might be interesting to reuse, whatever code repository it sits in.

ohenrik · Answer 22 · Mon Jun 25 2018 16:02:48 GMT+0800 (China Standard Time)

It looks like the documentation for these changes are not added. Is this officially added to the master repo?

Ines Montani · Answer 23 · Mon Jun 25 2018 18:25:48 GMT+0800 (China Standard Time)

It looks like the documentation for these changes are not added. Is this officially added to the master repo?

No, this issue is actually just the proposal and planning document for the new features (and for discussion and feedback). The new matcher API and pattern options aren't implemented yet – only the new algorithm and efficiency improvements (which should be on develop).

Tiru · Answer 24 · Wed Jul 25 2018 17:34:28 GMT+0800 (China Standard Time)

Is there any update on #2042

1.Can we remove phraseMatcher labels as we can do in matcher ?
2.I have requirement where i need to update and remove some of the phrase labels and their values. is there any way to do this ?
3. I have to create multiple phraseMatchers for multiple things, if i create bank base language class, will effect the my pre-trained language models, and language vocab which i have downloaded while installation of spacy?

Cole Chamberlin · Answer 25 · Sat Oct 20 2018 04:15:14 GMT+0800 (China Standard Time)

I was curious if there was a feature planned to enable matching on parse tree structure. I would love to be able to match subtree spans based on a combination of deps and heads.

Ines Montani · Answer 26 · Mon Oct 22 2018 03:09:10 GMT+0800 (China Standard Time)

@jessecoleman Yes, see here: #2836 🎉

John Vincent Pakson · Answer 27 · Thu Oct 25 2018 13:29:54 GMT+0800 (China Standard Time)

Hi. first of all thanks for this package. It works great!
And yes, is there any update on the issue of pickling PhraseMatcher instances? It would be great to have that. or if it was? was it merged with the current build? cause I cant seem to pickle my Phrase Matcher. Im working with Spacy v2.0.16

Mayank Raj · Answer 28 · Sat Nov 10 2018 23:46:43 GMT+0800 (China Standard Time)

Is there any update on this issue? I'm specifically looking forward to set membership and regex functionalities mentioned here. I am in process of building rules (additional ~8.6K) and getting rid of creating custom flags or token entity would make things much more cleaner.
@ines, If there's something blocking this from releasing, I'd be glad to take it up. I don't see any blocker here.

Lazhar · Answer 29 · Thu Nov 22 2018 01:22:42 GMT+0800 (China Standard Time)

Very interested in memberships (IN) to match custom-defined noun phrases such as ((NN|ADJ)+ NN). Hope it will make it soon!

Matthew Honnibal · Answer 30 · Sun Jan 20 2019 23:03:11 GMT+0800 (China Standard Time)

Progress can be followed here: #3173

Basic tests are passing 🎉

marina-sp · Answer 31 · Sat Jan 26 2019 00:48:44 GMT+0800 (China Standard Time)

Hi, I want to use spacy for recursive rules with '_' like suggested in https://github.com/explosion/spaCy/issues/1971#issuecomment-365251161. The difference is, that I want to set values for the extensions myself, but the spacy.matcher.Matcher does not take them into account.

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

nlp = spacy.load('en')

# define extenstion 'target' that is later used in a pattern
Token.set_extension('target', default  = 0)

# parse sentence
doc = nlp('A sentence with five words.')

# set different value for the extension manually
doc[0]._.set('target',1)

# find words where'target' extension value is 1
matcher = Matcher(nlp.vocab)
matcher.add('test', None, [ {'_': {'target': 1} } ])

print(matcher(doc)) 
# [] 
# matches nothing

Later on, I want to set custom extension 'test' to 1 for all matched words, and exploit it in later patterns.

What am I doing wrong? Am I misunderstanding the concept of custom extensions?
Please comment, if you need more info.

Nicolai Bjerre Pedersen · Answer 32 · Wed Feb 06 2019 19:19:54 GMT+0800 (China Standard Time)

I am thrilled about your work here (not only here but on spacy and prodigy in general!) and happy to see that it is now included in spacy-nightly release!

In the following example you want to match on United States

# match custom attribute values with regular expressions
pattern = [{'_': {'country': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}}]

But in order for this to work you first have to merge any occurrence of united states into one Token, right? But how do you do that if thats what you are actually looking for?

Ines Montani · Answer 33 · Wed Feb 06 2019 19:52:25 GMT+0800 (China Standard Time)

Yes, this is also one of my fav new features! To answer your question: In this example, the pattern is matching on an underscore attribute, i.e. token._.country (mostly to show that this is possible). So you might have a previous process that adds country names to certain tokens, based on some other rules.

In this example, it does assume that the tokens are merged:

pattern = [{'ORTH': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}},
           {'LOWER': 'president'}]

This could be from a scenario where you merge all entity spans in the doc.ents, so you'll end up with one token for each predicted entity, including "United States", "US" and so on. And then, later in the pipeline, you want to find this sequence and do something else with it. You could also have a matcher component that uses a list of all countries in the world and spelling variantions, and then merges those into one token. And in a later step, you might still want to use a regular expression to only extract specific sequences of tokens.

Nicolai Bjerre Pedersen · Answer 34 · Wed Feb 06 2019 20:09:15 GMT+0800 (China Standard Time)

Ah yes that was actually the example I meant to put there, my mistake. Is the following result not a bug?

nlp = spacy.load('en_core_web_sm')
pattern = [
    {'ORTH': '('},
    {'LIKE_NUM': True},
    {'LOWER': {'IN': ['m', 'million', 'bn', 'billion']}, 'OP': '?'},
    {'ORTH': ')'}]
matcher = Matcher(nlp.vocab)
matcher.add('MONEY_PARENTHESES', None, pattern)
for text in ['(30,0 m)', '(30,0)']:
    doc = nlp(text)
    matches = matcher(doc)
    print('{}'.format([doc[start:end] for _, start, end in matches]))

This outputs

 # [(30,0 m)]
 # []

while I'd expect

 # [(30,0 m)]
 # [(30,0)]

Unless I do not fully understand the OP keyword? I am trying to let m be optional.

Ines Montani · Answer 35 · Wed Feb 06 2019 20:41:13 GMT+0800 (China Standard Time)

Looks like it, yes 🤔 Added a test for it in a9bf5d9.

Nicolai Bjerre Pedersen · Answer 36 · Thu Feb 07 2019 16:51:12 GMT+0800 (China Standard Time)

Looks like it, yes 🤔 Added a test for it in a9bf5d9.

Also noted a bug where it returns the same exact match twice. Right now my example is rather big but I might be able to make it minimal. Do you want that?

Nicolai Bjerre Pedersen · Answer 37 · Thu Feb 07 2019 18:02:41 GMT+0800 (China Standard Time)

Another cool feature would be to have priorities on the PhraseMatcher. If I look for phrases like Net Sales and Sales and I have a text Net sales was EUR 10 million, then I want to return Net Sales and not Sales. I can add my own logic on the overlapping matches but I could imagine that others have the same challenge.

Marek Tomko · Answer 38 · Sat Feb 09 2019 00:31:43 GMT+0800 (China Standard Time)

I would like to ask if following is a bug or not and how to solve/work around it.

We have pattern with custom entity [{"_":{"Anatomy_DE":True}] and we need to set an OP operator.
And we have a strange results with half of operators.

Here are results:
{'_': {'Anatomy_DE': 'True'}} - 7 matches
{'_': {'Anatomy_DE': 'True'}, 'OP':'!'}- 0 matches
{'_': {'Anatomy_DE': 'True'}, 'OP':'?'} - 7 matches
{'_': {'Anatomy_DE': 'True'}, 'OP':'+'} - KeyError: "[E018] Can't retrieve string for hash '5764616310165154416'."
{'_': {'Anatomy_DE': 'True'}, 'OP':'*'} - KeyError: "[E018] Can't retrieve string for hash '3458773292271136006'."

Ines Montani · Answer 39 · Sat Feb 09 2019 00:40:07 GMT+0800 (China Standard Time)

@mr-bjerre Yes, a simple test case that produces a duplicate match would be helpful, too!

@tomkomt This is interesting! Do you have a simple, reproducible example? You don't have to share your whole code or custom component – maybe just a sentence where you set the Anatomy_DE attribute manually?

Nicolai Bjerre Pedersen · Answer 40 · Sat Feb 09 2019 03:10:13 GMT+0800 (China Standard Time)

Here you go

doc = nlp('Hello John Doe!')
matcher = Matcher(nlp.vocab)
Token.set_extension('optional', default=False)
matcher.add('JOHN', None, [{'ORTH': 'Doe'}, {'ORTH': '!', 'OP': '?'}, {'_': {'optional': True}, 'OP': '?'}, {'ORTH': '!', 'OP': '?'}])
assert len(matcher(doc)) == 1

Btw. I'm the same person as nix411 on prodigy forum - just an fyi :-)

Marek Tomko · Answer 41 · Mon Feb 11 2019 16:52:24 GMT+0800 (China Standard Time)

Hello @ines, I have something similar to mr-bjerre.

nlp = spacy.load('de', disable=['ner'])
doc = nlp('neblige aktivsten zehenverletzungs: auf proximaler Nervus trochlearis')
Token.set_extension('Anatomy_DE', default=False)
Doc.set_extension('Anatomy_DE', default=[])
[token._.set('Anatomy_DE', 'True') for token in Span(doc, 0, 1)]
[token._.set('Anatomy_DE', 'True') for token in Span(doc, 6, 8)]
matcher = Matcher(nlp.vocab)
matcher.add('Anatomy_DE_', None, [
    {'_': {'Anatomy_DE': 'True'}, 'OP':'?'}
])
matches = matcher(doc)
assert len(matches) == 8

I can reproduce results from Friday on this code.

Marek Tomko · Answer 42 · Mon Feb 11 2019 20:08:56 GMT+0800 (China Standard Time)

I think that one is solved. I used python3 installed from pkg file, instead of the one installed through brew and it works now.

Nicolai Bjerre Pedersen · Answer 43 · Tue Feb 12 2019 02:01:33 GMT+0800 (China Standard Time)

Found another bug. It seems to ignore the second pattern (the reversed pattern) in

import spacy
from spacy.matcher import Matcher


text = 'The amount is EUR 10 which is the same as 10 EUR.'
nlp = spacy.load('en_core_web_sm')

pattern = [
    {'LOWER': {'IN': ['eur', ]}}, {'LIKE_NUM': True},
]
matcher = Matcher(nlp.vocab)
matcher.add('MONEY', None, pattern, list(reversed(pattern)))

doc = nlp(text)
list(matcher(doc))

If I change pattern to

pattern = [
    {'LOWER': 'eur'}, {'LIKE_NUM': True},
]

then it works.

Nicolai Bjerre Pedersen · Answer 44 · Tue Feb 12 2019 20:43:24 GMT+0800 (China Standard Time)

Found another one related to merging of Span and custom attributes when spans overlap.

import spacy
from spacy.tokens.token import Token

nlp = spacy.load('en_core_web_sm')
doc = nlp('“Net sales and result reached record levels for a second quarter.“')

# manual matches Net sales and sales
matches = [(0, 1, 3), (1, 2, 3)]
Token.set_extension('is_matched', default=False)
spans = []  # collect the matched spans here
for match_id, start, end in matches:
    spans.append(doc[start:end])

# merge spans and set token attribute.
for span in spans:
    span.merge()  # merge
    for token in span:
        token._.is_matched = True

assert len([t for t in filter(lambda t: t._.is_matched, doc)]) == 1

When do you plan to release the next nightly?

Ines Montani · Answer 45 · Tue Feb 12 2019 22:00:14 GMT+0800 (China Standard Time)

@mr-bjerre Thanks for the thorough testing! I've added an xfailing test for your first example.

The second problem you describe is more of a problem with merging overlapping spans. After that first loop, the indices of the next span have changed, because parts of it have been merged into the previous token already. You can see this if you print the token texts for each span in the loop. There's not really an easy answer for this, so spaCy should raise an error, which it can't do with Span.merge, because it doesn't know what's coming / what previously happened.

However, if you're using the new Doc.retokenizer, spaCy is able to merge the spans in bulk and raises an error describing the problem:

doc = nlp('“Net sales and result reached record levels for a second quarter.“')
spans = [Span(doc, 1, 3), Span(doc, 2, 3)]
with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span)

ValueError: [E102] Can't merge non-disjoint spans. 'sales' is already part of tokens to merge

Nicolai Bjerre Pedersen · Answer 46 · Tue Feb 12 2019 22:04:54 GMT+0800 (China Standard Time)

Yes I realize that was the problem. However I thought that was the reason to collect the spans in a list in the first place in this example.

I suppose I will use Doc.retokenizer and catch ValueError exceptions and ignore those.

EDIT

I suppose that won't work since it's a bulk change. You are saying there are no easy workarounds?

Paul Rinckens · Answer 47 · Fri Feb 15 2019 21:37:24 GMT+0800 (China Standard Time)

Hi @ines,
I'm totally thrilled by the new matcher functionalities and am already using it a whole lot :-)
Currently I'm experiencing some unexpected Matcher behaviour when attempting to match on multiple extensions for the same token.

import spacy
from spacy.tokens import Token, Span
from spacy.matcher import Matcher

common_prefix = "feature"

nlp = spacy.load('de', disable=['ner'])
n = 2

print("Set extensions prior to pipeline call.")
for i in range(n):
    Token.set_extension(common_prefix + str(i), default=str(i))
Token.set_extension("hardcode", default="value")

doc = nlp("Das ist ein kurzer Beispielsatz.")

print("Extension values:")
for token in doc:
    print("Token:", token.text)
    for idx in range(n):
        print("    " + str(idx) + ": " + token._.get(common_prefix + str(idx)))
#    print("    hardcode: " + token._.get("hardcode"))
print("Available token extensions: {}.".format(doc[0]._.token_extensions.keys()))
print("----------------------------------")

matcher = Matcher(nlp.vocab)
pattern = [{"_": {common_prefix + str(i): str(i) for i in range(n)}}]*3
print("pattern for MYLABEL: {}".format(pattern))
# Comment to exclude this pattern from matching.
matcher.add("MYLABEL", None, pattern)

pattern = [{'_': {"hardcode": "value"}}]
print("Hardcoded pattern: {}".format(pattern))
# Commenting-out following line leads to no matches of MYLABEL pattern!
matcher.add("Hardcoded", None, pattern)

matches = matcher(doc)

print("Found {} matches.".format(len(matches)))
for match_id, start, end in matches:
    label = (doc.vocab.strings[match_id])
    span = Span(doc, start, end, label=match_id)
    print("Entity " + label + " matched by parsing rule at (" + str(start) + "," + str(end) + "): " + str(span.text))

This code yields matches of the MYLABEL pattern. However, when you comment-out line matcher.add("Hardcoded", None, pattern) the matcher will not yield any matches. It seems as the different patterns are affecting each other.

Do you have an idea what could cause this behaviour?

Ines Montani · Answer 48 · Sun Feb 17 2019 19:54:39 GMT+0800 (China Standard Time)

@mr-bjerre Well, there's just not an easy answer for how this should be handled by default. If your tokens are A, B and C and your merging specifies A + B = AB, and B + C = BC, that's a state that's not achievable. One token ABC is one possible solution – but if you need that, you'll have to specify the merging like that explicitly.

Each span exposes its start and end token index, so you can write you own filter function that checks whether spans overlap and if so, only choose the longest span for merging. (That's mostly just an algorithmic question and not really specific to spaCy or NLP.)

@paulrinckens Thanks for the report and for testing the nightly! 👍 If you have time, do you think you'd be able to break your example down into a simpler function with the minimum needed to reproduce the problem? It's a bit hard to follow with all the scaffolding code and programmatically generated extensions. Edit: Okay, I think I came up with something similar: see 3af0b2d.

Paul Rinckens · Answer 49 · Mon Feb 18 2019 19:36:29 GMT+0800 (China Standard Time)

@ines Yes this seems as a similar issue. However, the issue I was describing occurs when referring to multiple extension attributes on one individual token in the matcher pattern.

Here is a more reduced test:

import spacy
from spacy.tokens import Token
from spacy.matcher import Matcher

nlp = spacy.load('de', disable=['ner'])

Token.set_extension("ext_a", default="str_a")
Token.set_extension("ext_b", default="str_b")

doc = nlp("Das ist Text")
matcher = Matcher(nlp.vocab)
pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}]*3
matcher.add("MY_LABEL", None, pattern)
matches = matcher(doc)

assert len(matches) == 1

Ines Montani · Answer 50 · Mon Feb 18 2019 20:41:44 GMT+0800 (China Standard Time)

@paulrinckens Thanks! While writing the test, I actually uncovered a memory error (leading to a segmentation fault), so there's definitely something wrong under the hood here. I suspect there might only be one or two small issues that end up causing all of the reported problems here, since they're all quite similar.

Marek Tomko · Answer 51 · Mon Feb 18 2019 21:21:11 GMT+0800 (China Standard Time)

Hello @ines ,

I think I have similar issue like @paulrinckens here and similar like I had a few days ago.
I have multiple token extensions and they don't seem to be working together. I can't understand why. Also I can reproduce this issue with a single extension.

` pipeline = spacy.load('de', disable=['ner'])

matcher = Matcher(pipeline.vocab)
pattern = [
    {'ORTH': 'Es'},
    {'ORTH': 'knallt'},
    {'_': {'Extension_1': 'True'}, 'OP':'*'},
    {'_': {'Extension_2': 'True'}, 'OP':'*'}
]
patterns = []
patterns.append(pattern)
matcher.add('1234ABCD', None, *patterns)

doc = pipeline(u"Es knallt und und und knapp")

Token.set_extension('Extension_1', default=False)
Token.set_extension('Extension_2', default=False)
Doc.set_extension('Extension_1', default=[])
Doc.set_extension('Extension_2', default=[])
[token._.set('Extension_1', 'True') for token in Span(doc, 2, 3)]
[token._.set('Extension_1', 'True') for token in Span(doc, 3, 4)]
[token._.set('Extension_1', 'True') for token in Span(doc, 4, 5)]
[token._.set('Extension_2', 'True') for token in Span(doc, 5, 6)]

matches = matcher(doc)
for match_id, start, end in matches:
    try:
        span = Span(doc, start, end, label=match_id)
        print("Entity with match_id " + str(match_id) + " matched by parsing rule at (" + str(start) + "," + str(end) + "): " + str(span.text))
    except (ValueError, KeyError) as error:
        print(error)
        pass`

When I run this code, output is following:

[E084] Error assigning label ID 9223380818676074630 to span: not in StringStore.
Entity with match_id 15832469886398960032 matched by parsing rule at (0,3): Es knallt und
[E084] Error assigning label ID 9223380818676074630 to span: not in StringStore.
Entity with match_id 15832469886398960032 matched by parsing rule at (0,4): Es knallt und und
[E084] Error assigning label ID 9223380818676074630 to span: not in StringStore.
[E084] Error assigning label ID 9223380818676074630 to span: not in StringStore.
[E084] Error assigning label ID 9223380818676074630 to span: not in StringStore.

And when I exclude Extension_2 and remove use text "Es knallt und und und", I have following output:

Entity with match_id 15832469886398960032 matched by parsing rule at (0,2): Es knallt
Entity with match_id 15832469886398960032 matched by parsing rule at (0,3): Es knallt und
Entity with match_id 15832469886398960032 matched by parsing rule at (0,4): Es knallt und und
[E084] Error assigning label ID 13835066821962556849 to span: not in StringStore.

It seems like the last "und" is lost or not recognized.

Do you have idea what could be wrong?

Nicolai Bjerre Pedersen · Answer 52 · Tue Feb 19 2019 17:48:31 GMT+0800 (China Standard Time)

Another cool feature would be to add additional operators like {..., 'OP': [1, 2, 3]} allowing for matches between 1-3 times instead of only have * and + for multiple matches.

Nicolai Bjerre Pedersen · Answer 53 · Tue Feb 19 2019 18:04:52 GMT+0800 (China Standard Time)

It would also be very cool to allow for an optional sequence of Tokens. Imagine '(4,424) is an optional sequence of tokens in the following text: SEK 4,876 (4,424) million. It would be cool if something like the following was possible.

text = 'SEK 4,876 (4,424) million'
patterns = [
    [
        {'LIKE_NUM': True, 'OP': '+'},
        {'SEQUENCE': [
            {'ORTH': '('},
            {'LIKE_NUM': True, 'OP': '+'},
            {'ORTH': ')'},
        ], 'OP': '?'},
        {'LOWER': {'IN': ['million', 'billion'], 'OP': '?'}},
    ]
]
matcher = Matcher(nlp.vocab)
matcher.add('MATCH_TESTER', None, *patterns)
doc = nlp(text)

print_matches(doc, matcher(doc))

EDIT

It might be better to have a pipeline merging (4,424) into one Token and set a custom attribute like token._.is_money_parentheses and then

patterns = [
    [
        {'LIKE_NUM': True, 'OP': '+'},
        {'_': {'is_money_parentheses': True}, 'OP': '?'},
        {'LOWER': {'IN': ['million', 'billion'], 'OP': '?'}},
    ]
]

What are the differences in performance? Is it better to FIRST find a new custom attribute and THEN use that in later matches OR is it better to find the matches in one go (using more patterns)?

Nicolai Bjerre Pedersen · Answer 54 · Tue Feb 19 2019 19:07:37 GMT+0800 (China Standard Time)

Sorry for spam today but an OR feature would be absolutely great, e.g.

text = 'The current should be matched: USD 410 as well as: $410'
patterns = [
    [
        {'OR': [
            {'LOWER': 'usd'},
            {'NORM': '$'},
        ]},
        {'LIKE_NUM': True, 'OP': '+'},
    ]
]

At the moment you have to add a whole new pattern so that increases exponentially when you have a lot of those cases.

EDIT

I suppose you could add USD etc. to norm exceptions, but I'm not sure that is the best solution? And does this require me to create a new language and then also train a new model? Hopefully there is an easier way?

Ines Montani · Answer 55 · Tue Feb 19 2019 21:43:19 GMT+0800 (China Standard Time)

It might be better to have a pipeline merging (4,424) into one Token and set a custom attribute like token._.is_money_parentheses

Yes, I guess that's what I would have suggested. There's definitely a point at which patterns become very difficult to read, validate and maintain – so even if what you describe was possible, you'd probably want to break the logic into two steps: first, retokenize the document so that the tokens match your definition of "words" or "logical units", then match based on these units.

Performance-wise, matching a single token based on a simple equality comparison of one attribute is probably better than matching a sequence of tokens with multiple operators. But it likely doesn't matter that much. If you really want the ultimate efficiency, you should encode all your custom attributes as binary flags – but again, it might not actually be noticable or worth it.

I suppose you could add USD etc. to norm exceptions, but I'm not sure that is the best solution? And does this require me to create a new language and then also train a new model? Hopefully there is an easier way?

The Token.norm and Token.norm_ attributes are writable so you can just overwrite their values at runtime. Alternatively, this might also be a good use case for a custom extension attribute. The getter could look up the token text in a dictionary and default to the regular norm:

def get_custom_norm(token):
    return my_custom_norm_dict.get(token.text, token.norm_)

Nicolai Bjerre Pedersen · Answer 56 · Wed Feb 20 2019 02:46:05 GMT+0800 (China Standard Time)

That was clever with the getter ! But there seems to be a bug. Might be related to the other bugs.

custom_norm_exceptions = {
    "dkk": "$",
}

Token.set_extension('norm', getter=lambda t: custom_norm_exceptions.get(t.text.lower(), t.norm_), force=True)

text = 'DKK 4,876 (4,424) million. Another example here $410 and € 415'

I have doc[0]._.norm = $ as expected, BUT the following pattern returns nothing

patterns = [
    [
        {'_': {'norm': '$'}},
    ],
]
matcher = Matcher(nlp.vocab)
matcher.add('MATCH_TESTER', None, *patterns)

Of course I could also just have

Token.set_extension('is_currency', getter=lambda t: t._.norm == '$', force=True)

but how is that performance-wise compared to the Matcher with on_match?

Nicolai Bjerre Pedersen · Answer 57 · Wed Feb 20 2019 20:20:05 GMT+0800 (China Standard Time)

@mr-bjerre Well, there's just not an easy answer for how this should be handled by default. If your tokens are A, B and C and your merging specifies A + B = AB, and B + C = BC, that's a state that's not achievable. One token ABC is one possible solution – but if you need that, you'll have to specify the merging like that explicitly.

Each span exposes its start and end token index, so you can write you own filter function that checks whether spans overlap and if so, only choose the longest span for merging. (That's mostly just an algorithmic question and not really specific to spaCy or NLP.)

I agree with you @ines. I made a function that takes care of the overlapping spans so I get ABC as I intend to.

I suppose this shouldn't throw a KeyError though, right?

text = 'DKK 4,876 (4,424) million. Another example here $410.'
nlp = spacy.load('en')
doc = nlp(text)
spans = [doc[0:2], doc[10:12]]

with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span)

I get

  File "_retokenize.pyx", line 73, in spacy.tokens._retokenize.Retokenizer.__exit__
  File "_retokenize.pyx", line 225, in spacy.tokens._retokenize._bulk_merge
KeyError: 0

Ines Montani · Answer 58 · Wed Feb 20 2019 23:02:34 GMT+0800 (China Standard Time)

@mr-bjerre Yes, this seems to be the same retokenization issue as #3288 (also see my regression test that shows the underlying problem). I think the other problem you reported looks like it's related to the other matcher issue as well. Matching on custom attributes isn't reliable yet in the current nightly. (We already tracked this down, though.)

Nicolai Bjerre Pedersen · Answer 59 · Wed Feb 20 2019 23:16:01 GMT+0800 (China Standard Time)

Alright thanks for update. Tracked down meaning solved? What about the issue with ? operator? That one is causing me a lot of trouble - let me know if I can be of any help.

Nicolai Bjerre Pedersen · Answer 60 · Thu Feb 21 2019 16:15:01 GMT+0800 (China Standard Time)

Btw. @ines is it correct that norm and norm_ are not writable in v2.0.18?

token.norm_ = 'new_norm'

yields error

AttributeError: attribute 'norm_' of 'spacy.tokens.token.Token' objects is not writable

Matthew Honnibal · Answer 61 · Thu Feb 21 2019 19:26:58 GMT+0800 (China Standard Time)

That should be fixed in v2.1 --- if you try spacy-nightly it should work.

Paul Rinckens · Answer 62 · Tue Feb 26 2019 22:40:40 GMT+0800 (China Standard Time)

Hi @ines

I think I found another bug in the new Matcher. When using regex on custom features with the OP attribute, I get matches that I did not expect to get.

See this code snippet:

import spacy
from spacy.tokens import Token, Span
from spacy.matcher import Matcher

nlp = spacy.load('de', disable=['ner'])
doc = nlp("Das ist Text")

Token.set_extension("a", default="False")
doc[0]._.set("a","x")
doc[1]._.set("a","y")

matcher = Matcher(nlp.vocab)
pattern = [{'_': {'a': {'REGEX': 'x'}}}, {'_': {'a': {'REGEX': 'y'}}, 'OP': '*'}]
matcher.add("MYLABEL", None, pattern)

# Expect matches "Das" and "Das ist"
assert(len(matcher(doc)) == 2)

However, the matches obtained from the Matcher are:
"Das"
"Das ist"
"Text"

Do you have an idea what could have caused this last obviously incorrect match "Text"?

EDIT: This seems to be fixed in 2.1.0a10

Jenny Lee · Answer 63 · Sat Mar 02 2019 04:59:51 GMT+0800 (China Standard Time)

spacy version = 2.0.12

Some unexpected behavior with Matcher:

Issue 1

matcher seems to match {'POS': 'ADJ'} but not {'TEXT': {'NOT_IN': ['good', 'bad']} in the first token.

pattern =  [{'TEXT': {'NOT_IN': ['good', 'bad']}, 'POS': 'ADJ'}, {'POS': 'NOUN'}] 
doc = nlp('I knew a nice boy and a famous girl.')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('test', None, pattern)
matches = matcher(doc)
matches
Out[47]: [(1618900948208871284, 3, 5), (1618900948208871284, 7, 9)]
for m in matches:
    print(doc[m[1]: m[2]])
    
nice boy
famous girl

[(t.i, t, t.pos_, t.tag_, t.dep_) for t in doc]
Out[54]: 
[(0, I, 'PRON', 'PRP', 'nsubj'),
 (1, knew, 'VERB', 'VBD', 'ROOT'),
 (2, a, 'DET', 'DT', 'det'),
 (3, nice, 'ADJ', 'JJ', 'amod'),
 (4, boy, 'NOUN', 'NN', 'dobj'),
 (5, and, 'CCONJ', 'CC', 'cc'),
 (6, a, 'DET', 'DT', 'det'),
 (7, famous, 'ADJ', 'JJ', 'amod'),
 (8, girl, 'NOUN', 'NN', 'conj'),
 (9, ., 'PUNCT', '.', 'punct')]

Issue 2

LOWER (as opposed to TEXT) triggers a TypeError.

p = [{'LOWER': {'NOT_IN': ['poorly', 'surly']}}, {'POS': 'NOUN'}] 
mat = Matcher(nlp.vocab)
mat.add('text_lower', None, p)
Traceback (most recent call last):
  File "/Users/jlee/chegg/wtai-classifiers/venv2/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-59-fff92a5a3524>", line 1, in <module>
    mat.add('text_lower', None, p)
  File "matcher.pyx", line 273, in spacy.matcher.Matcher.add
  File "matcher.pyx", line 100, in spacy.matcher.init_pattern
TypeError: an integer is required

I tried updating to the latest version of Spacy (and then to spacy-nightly), but the same thing seemed to happen.

lock · Answer 64 · Mon Apr 01 2019 05:38:33 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.