Support Bio NER using Stanza processor
Piyush13y opened this issue · comments
Is your feature request related to a problem? Please describe.
We currently have stanza_processor
implemented in forte-wrappers
repo, which supports these components - tokenize
, pos
, lemma
, depparse
. However, we want to incorporate NER functionality to our processor as well. Stanza, by itself does support Bio NER and we can just leverage that for our use case as well.
You can refer to the following link on tutorial to use bioNER with stanza. On understanding these examples, all you have to do is incorporate that into Forte structure. If you go through stanza_processor, you will see how the other functionalities work through stanza.Pipeline()
. You can follow the same design to add NER component to the processor.
Link: https://stanfordnlp.github.io/stanza/biomed_model_usage
Also, you can refer NegationContextAnalyzer processor to assist you with writing the code according to Forte principles.
Lastly, stanza_processor
is defined in the forte-wrappers
repository and hence the changes will effectively be in that repo and the PR can link this issue.
I made two changes:
First, I support basic NER in StanfordNLPProcessor (not Bio NER)
For example, it will identify "Forte" as an "ORG"
Second, I add a new Processor named StandfordNLPBioNERProcessor:
with only one line: pipeline.add(StanfordNLPBioNERProcessor())
It can do all the processes, including tokenize/pos, all based on "mimic" or "i2b2", specifically for medical domains.
By using this processor, it can identify "appendicitis" as "PROBLEM"
I have not pulled a request yet, does what I add meet the requirements you give?
p.s. All pass the test.
Or maybe I should change the name StandfordNLPBioNERProcessor
By the way, when I was debugging, it seems like some folders or files are generated automatically, such as some downloaded models and resourses.json, is it correct that I should delete these before I PR?
Hi,
The aim of this issue is to add the BioNER component to the existing stanza_processor
in forte-wrappers
instead of creating a new one in ForteHealth
, you can look at the file here: https://github.com/asyml/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py
You have to add your changes to this file in forte-wrappers repository to enable BioNER component for our pipelines through stanza. Let me know if you have any questions.
And yes, you can skip all the generated files for this PR.
Hi Piyush,
I do modify the file you give, but make two kinds of changes:
- support ner in the class
StanfordNLPProcessor
in stanza_processor.py - add a new class
StanfordNLPBioNER
to support Bio NER in stanza_processor.py
That is in the stanza_processor.py, I give two classes:
StanfordNLPProcessor
for basic pre-precessing and basic NERStanfordNLPBioNERProcessor
for Bio NER, and all the models are based on mimic or i2b2.
Do these make sense? : )
You may check it in my github: https://github.com/Leolty/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py
@Leolty I added this comment to your PR too, adding it here as well for consistency.
As the issue already mentions, we want to support Bio NER through the existing stanza_processor
. You can try making it modular if you wish to enable both Bio and non Bio NER, but for this issue we want to incorporate BioNER to existing stanza_processor class. We can think about branching it out in the future if need be.