asyml / ForteHealth

The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios. This is part of the CASL project: http://casl-project.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for Coreference Resolution

Piyush13y opened this issue · comments

Is your feature request related to a problem? Please describe.
Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. We plan to add support for coreferencing into our pipeline through a CoreferenceProcessor and this issue will help you get the implementation kickstarted.

Describe the solution you'd like
We will be developing a wrapper around Huggingface's NeuralCoref library to suit our use case and leverage their pre trained model for coreference resolution purposes. It uses spaCy with Neural Networks in the backend. The following is the link to the GitHub repo for the NeuralCoref project:
https://github.com/huggingface/neuralcoref

This is the blogpost by Huggingface to better describe their coreference resolution:
https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30

Please give the GitHub repository Readme file and the blogpost a read as it would help implementing the wrapper around NeuralCoref.
Their model is trained on English language (non biomedical corpus). The ontologies pertaining to this issue have already been defined, i.e. CoreferenceGroup, EntityMention, MedicalEntityMention. CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline.

doc._.coref_clusters <=> CoreferenceGroup
doc._.coref_clusters[1].mentions <=> EntityMentions

(Building and Generating Ontologies documentation)

As is clear from the GitHub repository, if doc._.has_coref is True, doc._.coref_clusters returns a list of all coref clusters, each of which would in turn define CoreferenceGroups. NeuralCoref mentions are all Span objects, which implies its straightforward to define EntityMentions/MedicalEntityMentions from these. These in turn can then be used to define a CoreferenceGroup.

Regarding config for the processor, the user can provide values for greediness, max_dist, blacklist, etc. These parameters are mentioned in the GitHub repository readme and can be referred to for more details.

Example call:

pl.add(
        CoreferenceProcessor(),
        {
            lang: "en_core_web_sm",
            greedyness: 0.75,
            max_dist: 50,
            max_dist_match: 500, 
        },
    )

Another thing that we will have to ensure is that we must install neuralcoref along with forte-medical. Hence, it will have to be added to the setup.py and requirements files.

Also, make sure you add unit test cases for the processor. You can refer any of the test files in https://github.com/asyml/ForteHealth/tree/master/tests/forte_medical/processors for reference.

P.S. You can follow NegationContextAnalyzer processor for the structure and code design. It can be used as the template processor to refer to when implementing a new one.

Describe alternatives you've considered
Several papers were referred to and a couple GitHub repositories as well. E2E was another alternative to this for coreference resolution, but Huggingface's NeuralCoref seems to be strightforward to implement and since we already have spaCy based processors in our code base, it can be easier to write this wrapper.

@KiaLAN maybe we can work on this together.
@Piyush13y could you please add me to the collaborator list? Zhiting said I should get my hands dirty by this issue.

@Leolty Sounds good.

To create a NeuralCoref object, I need to create a spaCy pipeline first, then add it to the pipeline:

nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

Currently my test code adds a SpacyProcessor right before the CoreferenceProcessor.

Here the problem comes:

  1. If CoreferenceProcessor contains its own spaCy pipeline, it may be different from the pipeline in SpacyProcessor. Then, the tokenization of SpacyProcessor and CoreferenceProcessor may be different. I am not sure if this is a behavior we want.
  2. Currently my implementation let CoreferenceProcessor borrow the pipeline from SpacyProcessor, which ensures the tokenization to be the same. But I find that the intermediate result cannot be got from SpacyProcessor, so I have to run the pipeline inside the CoreferenceProcessor again, which is not very elegant.

Another behavior I found:

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.

Since NeuralCoref is trained on daily language, it is not doing good at resolving medical coreference.
example output

It can help to identify coref group related to the person (like the patient in a discharge note). But it would be nice if we can find other models that can do better on the medical-related text.

CoreferenceGroup currently works with EntityMention members, and we might have to translate/merge those as MedicalEntityMention for our medical pipeline

When I read this, I think @Piyush13y means we need to do coref resolution for medical entities.

The rationale behind this (using a new ontology name instead of EntityMention), is to allow this tool to use its own set of ontologies so it doesn't necessarily conflict with the output from other tools, like the ones here: https://github.com/asyml/forte-wrappers.

We would certainly like to do better coref on domain-related entities, if you can find existing models we can set that up too. But if we don't have good alternatives right now we can use this to resolve some coref chains at the moment.