rafelafrance / traiter

Extract information from natural history annotations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Examine using spacy

rafelafrance opened this issue · comments

commented

It looks like spacy may handle all of the token and pattern types so far. If this is true and it simplifies my code then I should replace my parsers with spacy entities, pipelines, etc. Additionally, I'll be able to separate each parser set into its own repository which will reduce simplify the projects.

Spacy will almost certainly handle label_babel and efloras parsing, however, I do have concerns about vertnet parsing. Testing it on the new bat parsers should prove this one way or another.

Note to self: Having extensive test suites is proving to be critical to this project's success.

The advantages to using spacy are numerous (these are just some of them):

  • I can leverage a team of experts in NLP who are developing NLP infrastructure full-time
  • There are numerous resources for tricky NLP problems on the web using spacy
  • I don't have to reinvent the wheel for everything: word vectors, NLP pipelines, etc.
  • I can quickly bring to bear many more NLP techniques
  • Matching on linguistic aspects of words like: Parts of speech etc.
  • more

The advantages to keeping stacked regex:

  • I own the code and can change things to suit my needs without hassle
  • I know it works on my problem set
  • It's fun to develop. Well that's not really a "reason" except that it keeps my motivation up to do this
commented

I have completed the Vernet sex trait parser and it works pretty well.

The only major issue is the lack the the ability to tag tokens in rules in a way that is analogous to using regular expression capture groups. The problem has hit other people and the spacy team has an open issue for it. If they deliver this issue then the code for probing for tokens in entities will go away and the trait getter functions will be greatly simplified.

See spacy issue Sub-pattern labeling for pattern matcher 3275

I'm going to try some more complicated trait parsers now. If I can find a workaround for the above issue then I'll adopt spacy now. If I can't then I'll wait until the above issue is solved.

commented

I'm going to try to add a custom entity (trait) recognizer pipeline step to Spacy to do what I need.

commented

It's probably because I don't understand the appropriate strategies for using the Spacy rulers (matcher, phrase matcher, & entity) but this isn't working -- not for the lack of trying tho. Which is too bad because Spacy is a stellar project & there's a lot that I learned when reading its code. I still want to circle back to it at a later date.

In the mean time I'm going to open issues for allowing me to use lemmatization and parts of speech tagging, etc. with the stacked regular expression. I'm going to need to use tokens as bytes and probably migrate the code to Cython.