asyml / ForteHealth

The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios. This is part of the CASL project: http://casl-project.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support Bio NER using Stanza processor

Piyush13y opened this issue · comments

Is your feature request related to a problem? Please describe.
We currently have stanza_processor implemented in forte-wrappers repo, which supports these components - tokenize, pos, lemma, depparse. However, we want to incorporate NER functionality to our processor as well. Stanza, by itself does support Bio NER and we can just leverage that for our use case as well.

You can refer to the following link on tutorial to use bioNER with stanza. On understanding these examples, all you have to do is incorporate that into Forte structure. If you go through stanza_processor, you will see how the other functionalities work through stanza.Pipeline(). You can follow the same design to add NER component to the processor.

Link: https://stanfordnlp.github.io/stanza/biomed_model_usage

Also, you can refer NegationContextAnalyzer processor to assist you with writing the code according to Forte principles.

Lastly, stanza_processor is defined in the forte-wrappers repository and hence the changes will effectively be in that repo and the PR can link this issue.

I made two changes:

First, I support basic NER in StanfordNLPProcessor (not Bio NER)
For example, it will identify "Forte" as an "ORG"

Second, I add a new Processor named StandfordNLPBioNERProcessor:
with only one line: pipeline.add(StanfordNLPBioNERProcessor())
It can do all the processes, including tokenize/pos, all based on "mimic" or "i2b2", specifically for medical domains.

By using this processor, it can identify "appendicitis" as "PROBLEM"

I have not pulled a request yet, does what I add meet the requirements you give?

p.s. All pass the test.

Or maybe I should change the name StandfordNLPBioNERProcessor

By the way, when I was debugging, it seems like some folders or files are generated automatically, such as some downloaded models and resourses.json, is it correct that I should delete these before I PR?

Hi,
The aim of this issue is to add the BioNER component to the existing stanza_processor in forte-wrappers instead of creating a new one in ForteHealth, you can look at the file here: https://github.com/asyml/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py
You have to add your changes to this file in forte-wrappers repository to enable BioNER component for our pipelines through stanza. Let me know if you have any questions.
And yes, you can skip all the generated files for this PR.

Hi Piyush,
I do modify the file you give, but make two kinds of changes:

  1. support ner in the class StanfordNLPProcessor in stanza_processor.py
  2. add a new class StanfordNLPBioNER to support Bio NER in stanza_processor.py

That is in the stanza_processor.py, I give two classes:

  1. StanfordNLPProcessor for basic pre-precessing and basic NER
  2. StanfordNLPBioNERProcessor for Bio NER, and all the models are based on mimic or i2b2.

Do these make sense? : )

You may check it in my github: https://github.com/Leolty/forte-wrappers/blob/main/src/stanza/fortex/stanza/stanza_processor.py

@Leolty I added this comment to your PR too, adding it here as well for consistency.
As the issue already mentions, we want to support Bio NER through the existing stanza_processor. You can try making it modular if you wish to enable both Bio and non Bio NER, but for this issue we want to incorporate BioNER to existing stanza_processor class. We can think about branching it out in the future if need be.