ljvmiranda921 / comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

spaCy Internals: Rules-based rules!

utterances-bot opened this issue · comments

spaCy Internals: Rules-based rules!

spaCy has a comprehensive way to define rules for matching tokens, phrases, entities (and more!) to enhance statistical models. In this blog post, I'll share...

https://ljvmiranda921.github.io/notebook/2022/12/25/rules-based-rules/

Hi Thank you for the great Article, i'm having a problem running assemble command using your provided ruler.cfg file, the error i'm getting is as follow

✘ Error parsing config section. Perhaps a section name is wrong?
initialize -> components -> span_ruler	Section 'components' is not defined
{'nlp': {'pipeline': ['tok2vec', 'ner', 'span_ruler']}, 'components': {'ner': {'source': '/content/drive/MyDrive/output_spacy/model-best'}, 'span_ruler': {'factory': 'span_ruler', 'spans_key': None, 'annotate_ents': True, 'ents_filter': {'@misc': 'spacy.prioritize_new_ents_filter.v1'}, 'validate': True, 'overwrite': False}, 'tok2vec': {'source': '/content/drive/MyDrive/output_spacy/model-best'}}, 'initialize': {}}

can you please help

Hi sorry about that, I wasn't able to mention that the ruler.cfg is just an excerpt. Will update in a few. I suggest looking at the example project instead (this is from a forked PR, we'll merge this very soon to the main projects repository) instead to see the full config.

Hi Thanks for clarifying, Much appreciated :)

commented

Hi :)

Many thanks for this post as it clarified the use of span_ruler a bit closer. I have, however, some issues with understanding the pipeline architecture when using a span_ruler and spancat.

I have used simple TEXT/lower patterns that match whole sentences and used sentencizer as an annotating component and as a component in the pipeline (["sentencizer","tok2vec","spancat"], in this order). This worked even though I had no [components.span_ruler] in my training config.

I now used a pattern similar to the one you posted, with an additional ENT_TYPE pattern, and the training returns 0.00 scores on all scoring metrics. Do I need to pass any component to annotating_components = []?

Currently, my pipeline components are: ["tok2vec", "spancat", "span_ruler"] and the span_ruler and spancat components are:

[components.span_ruler]
factory = "span_ruler"
spans_key = "ruler"
validate = true
overwrite = false

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "ruler"
threshold = 0.5

Since data debug finds no issues with my training data, I assume the issue must be with either 1) the order of my components in which they are initialized or 2) the parameters in the config itself.

Thanks a lot for any help and apologies for reaching out here instead of on Github.

commented

To add to that, my config.cfg in the trained (with 0.00 scorer, so not really) model looks like this:

[components.span_ruler]
factory = "span_ruler"
annotate_ents = false
ents_filter = {"@misc":"spacy.first_longest_spans_filter.v1"}
matcher_fuzzy_compare = {"@misc":"spacy.levenshtein_compare.v1"}
overwrite = false
phrase_matcher_attr = null
spans_filter = null
spans_key = "ruler"
validate = true

[components.span_ruler.scorer]
@scorers = "spacy.overlapping_labeled_spans_scorer.v1"
spans_key = "ruler"

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "ruler"
threshold = 0.5