Support Bio NER using Stanza processor

Question

Support Bio NER using Stanza processor

Piyush13y opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.
We currently have stanza_processor implemented in forte-wrappers repo, which supports these components - tokenize, pos, lemma, depparse. However, we want to incorporate NER functionality to our processor as well. Stanza, by itself does support Bio NER and we can just leverage that for our use case as well.

You can refer to the following link on tutorial to use bioNER with stanza. On understanding these examples, all you have to do is incorporate that into Forte structure. If you go through stanza_processor, you will see how the other functionalities work through stanza.Pipeline(). You can follow the same design to add NER component to the processor.

Link: https://stanfordnlp.github.io/stanza/biomed_model_usage

Also, you can refer NegationContextAnalyzer processor to assist you with writing the code according to Forte principles.

Lastly, stanza_processor is defined in the forte-wrappers repository and hence the changes will effectively be in that repo and the PR can link this issue.

Tianyang Liu · Answer 1 · Mon Jun 13 2022 00:54:32 GMT+0800 (China Standard Time)

I made two changes:

First, I support basic NER in StanfordNLPProcessor (not Bio NER)
For example, it will identify "Forte" as an "ORG"

Second, I add a new Processor named StandfordNLPBioNERProcessor:
with only one line: pipeline.add(StanfordNLPBioNERProcessor())
It can do all the processes, including tokenize/pos, all based on "mimic" or "i2b2", specifically for medical domains.

By using this processor, it can identify "appendicitis" as "PROBLEM"

I have not pulled a request yet, does what I add meet the requirements you give?

p.s. All pass the test.

Tianyang Liu · Answer 2 · Mon Jun 13 2022 00:58:45 GMT+0800 (China Standard Time)

Or maybe I should change the name StandfordNLPBioNERProcessor

Tianyang Liu · Answer 3 · Mon Jun 13 2022 01:09:23 GMT+0800 (China Standard Time)

By the way, when I was debugging, it seems like some folders or files are generated automatically, such as some downloaded models and resourses.json, is it correct that I should delete these before I PR?

Piyush Yadav · Answer 4 · Mon Jun 13 2022 06:38:09 GMT+0800 (China Standard Time)

Hi,
The aim of this issue is to add the BioNER component to the existing stanza_processor in forte-wrappers instead of creating a new one in ForteHealth, you can look at the file here: https://github.com/asyml/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py
You have to add your changes to this file in forte-wrappers repository to enable BioNER component for our pipelines through stanza. Let me know if you have any questions.
And yes, you can skip all the generated files for this PR.

Tianyang Liu · Answer 5 · Mon Jun 13 2022 09:29:49 GMT+0800 (China Standard Time)

Hi Piyush,
I do modify the file you give, but make two kinds of changes:

support ner in the class StanfordNLPProcessor in stanza_processor.py
add a new class StanfordNLPBioNER to support Bio NER in stanza_processor.py

That is in the stanza_processor.py, I give two classes:

StanfordNLPProcessor for basic pre-precessing and basic NER
StanfordNLPBioNERProcessor for Bio NER, and all the models are based on mimic or i2b2.

Do these make sense? : )

You may check it in my github: https://github.com/Leolty/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py

Piyush Yadav · Answer 6 · Tue Jun 14 2022 11:04:34 GMT+0800 (China Standard Time)

@Leolty I added this comment to your PR too, adding it here as well for consistency.
As the issue already mentions, we want to support Bio NER through the existing stanza_processor. You can try making it modular if you wish to enable both Bio and non Bio NER, but for this issue we want to incorporate BioNER to existing stanza_processor class. We can think about branching it out in the future if need be.