Examine using spacy

Question

Examine using spacy

rafelafrance opened this issue 5 years ago · comments

It looks like spacy may handle all of the token and pattern types so far. If this is true and it simplifies my code then I should replace my parsers with spacy entities, pipelines, etc. Additionally, I'll be able to separate each parser set into its own repository which will reduce simplify the projects.

Spacy will almost certainly handle label_babel and efloras parsing, however, I do have concerns about vertnet parsing. Testing it on the new bat parsers should prove this one way or another.

Note to self: Having extensive test suites is proving to be critical to this project's success.

The advantages to using spacy are numerous (these are just some of them):

I can leverage a team of experts in NLP who are developing NLP infrastructure full-time
There are numerous resources for tricky NLP problems on the web using spacy
I don't have to reinvent the wheel for everything: word vectors, NLP pipelines, etc.
I can quickly bring to bear many more NLP techniques
Matching on linguistic aspects of words like: Parts of speech etc.
more

The advantages to keeping stacked regex:

I own the code and can change things to suit my needs without hassle
I know it works on my problem set
It's fun to develop. Well that's not really a "reason" except that it keeps my motivation up to do this

rafe · Answer 1 · Fri Jan 17 2020 19:30:46 GMT+0800 (China Standard Time)

I have completed the Vernet sex trait parser and it works pretty well.

The only major issue is the lack the the ability to tag tokens in rules in a way that is analogous to using regular expression capture groups. The problem has hit other people and the spacy team has an open issue for it. If they deliver this issue then the code for probing for tokens in entities will go away and the trait getter functions will be greatly simplified.

See spacy issue Sub-pattern labeling for pattern matcher 3275

I'm going to try some more complicated trait parsers now. If I can find a workaround for the above issue then I'll adopt spacy now. If I can't then I'll wait until the above issue is solved.

rafe · Answer 2 · Tue Jan 28 2020 00:26:02 GMT+0800 (China Standard Time)

I'm going to try to add a custom entity (trait) recognizer pipeline step to Spacy to do what I need.

rafe · Answer 3 · Tue Jan 28 2020 23:56:21 GMT+0800 (China Standard Time)

It's probably because I don't understand the appropriate strategies for using the Spacy rulers (matcher, phrase matcher, & entity) but this isn't working -- not for the lack of trying tho. Which is too bad because Spacy is a stellar project & there's a lot that I learned when reading its code. I still want to circle back to it at a later date.

In the mean time I'm going to open issues for allowing me to use lemmatization and parts of speech tagging, etc. with the stacked regular expression. I'm going to need to use tokens as bytes and probably migrate the code to Cython.