Issues with AbbreviationFinderStep deleting entities

Question

Issues with AbbreviationFinderStep deleting entities

simoond opened this issue 5 months ago · comments

I noticed that kazu changed the input text. I'm curious how it made this mistake and how to prevent it.

To recreate: I passed in a text block that contained the text "The CFIm25 deletion leads to 3’ UTR shortening". The third character of the gene is an uppercase I, not a lowercase l.

When running this through kazu, the output is like below. It correctly knows CFIM25 is a gene but it can't entity link it because it changed the input text uppercase i to a lowercase L

CFlm25:gene:TransformersModelForTokenClassificationNerStep:959:965

Because it changed the input text, the entity mappings are null

David Twomey · Answer 1 · Sat Mar 02 2024 01:11:49 GMT+0800 (China Standard Time)

I spoke too soon. The text was not changed. The original text contained many instances of the gene correctly written and one typo. Kazu did not pick up the correctly named genes but only picked up the typo.

Here is the full input text. Most of the genes are correct with CFIm25 (uppercase i) and those are not picked up but the typo CFlm25 (lowercase l) is picked up as a gene but obviously not entity linked. Any thoughts why the corect gene was not recognized?

text = "Systemic sclerosis (SSc) is a multi-system, fibrotic disease that affects the skin and a variety of internal organs. Persistent myofibroblast activation and associated excessive extracellular matrix protein (ECMs) deposition are hallmarks of the disease. The mechanisms that account for this excessive fibrotic response remain elusive. Despite its high mortality and morbidity, there are no FDA approved medications for fibrotic complications of SSc. Our recent work indicates that a RNA-processing mechanism known as alternative polyadenylation (APA) is critical for the upregulation of profibrotic genes in dermal fibrosis through 3’ UTR tail shortening of key transcripts. Recent evidence demonstrated that a key regulator of APA is Cleavage factor Im 25 (CFIm25). The CFIm25 deletion leads to 3’ UTR shortening. A truncated 3’ UTR tail will often lack microRNA binding sites and evades the microRNA-mediated gene repression. Our preliminary data indicate CFlm25 is downregulated in fibrotic skin of SSc patients and in murine dermal fibrosis models. Knockdown of CFIm25 in normal skin fibroblasts is sufficient to promote the 3’ UTR shortening of key profibrotic genes. Moreover, the central fibrotic cytokine TGFβ suppresses CFIm25 expression through miR-203 upregulation. Overall, our data uncovered a novel mechanism by which TGFβ mediated CFIm25 downregulation leads to 3’ UTR shortening and the over-production of profibrotic factors and ECMs, and contributes to the pathogenesis of SSc. Based on this preliminary data, the main hypothesis of this project is that TGFβ and miR-203 downregulate CFIm25 in fibroblasts, resulting in dermal fibrosis by upregulation of profibrotic genes and ECMs through 3’ UTR shortening. This hypothesis will be examined in the following specific aims: Aim I: Define the fibroblast specific contribution of CFIm25 depletion in dermal fibrosis murine models. This aim will elucidate the downstream effects of CFIm25 depletion on key fibrotic pathways. Aim II: Determine the mechanisms for CFIm25 downregulation and assess their potential as therapeutic targets in dermal fibrosis. The mechanisms for TGFβ mediated miR-203 downregulation and subsequent CFIm25 repression by miR-203 will be elucidated. This aim will characterize the upstream events leading to CFIm25 depletion and will identify potential therapeutic targets. Aim III: Characterize CFIm25 in SSc human skin/fibroblasts and identify fibrotic genes dysregulated by APA. Serial dermal fibroblasts and skin samples from patients with early SSc and matched controls will be examined using a novel RNA sequencing technology. This aim will provide for the first time an unbiased, longitudinal view of CFIm25 mediated APA profile in a fibrotic disease. The proposed research links for the first time the CFIm25 mediated 3’ UTR shortening to dermal fibrosis. This can lead to discovering a key mechanism that amplifies the fibrotic response in SSc and ultimately to identifying an entirely novel target for treatment of persons with this potentially devastating disease."

Elliot Ford · Answer 2 · Mon Mar 04 2024 17:35:11 GMT+0800 (China Standard Time)

Hi, thanks for opening the issue and trying out kazu!

This looks like it would have been a frustrating case to look into, sorry about that.

The summary of what's happening is 'CFIM25' entities are getting deleted because our AbbreviationFinderStep recognises the phrase 'Cleavage factor Im 25 (CFIm25)' as explaining what the abbreviation CFIm25 means, but we don't have an entity for the 'full text' version "Cleavage factor Im 25", so we delete all future "CFIm25" entities in the text. We don't delete the entity with the lower case 'l' because that's different enough not to get affected by that logic (but still recognised as a gene by our Transformers-based NER model).

That may sound like a strange thing to delete these entities in this case, but actually it often helps delete incorrect entities when an author is using an abbreviation that they define in the text and isn't standard enough to occur in biomedical ontologies, but 'clashes' with a more standard abbreviation for another entity.

We actually see this behaviour adding value in your text: there’s also the phrase “alternative polyadenylation (APA)” which our AbbreviationFinderStep picks up, and uses in the same way to delete entities we had recognised for APA, which is a synonym of an unrelated gene (according to OpenTargets) but isn’t relevant here.

We also get a case where we’re not deleting entities with “Systemic sclerosis (SSc)” - here we do have an entity for “Systemic sclerosis”, so the AbbreviationFinderStep just makes sure that future mentions of SSc get the same entity class (disease) and are linked to the same ID(s).

The easiest/quickest way for you to work around this issue is to specify the abbreviation as ‘excluded’ from the AbbreviationFinderStep. You can do this using a Hydra ‘override’ (which I’ve written to include the existing ‘default values’ for that field as well). If you’re using code similar to this section of the quickstart, you can just add an argument on the command line:

python no_vc/your_script.py 'AbbreviationFinderStep.exclude_abbrvs=[CFIm25,COPD,NSCLC,mCRC,NHL,DEND]’

If you’re using code more like this section, relying on the Hydra ‘compose’ api, this becomes:

    with initialize_config_dir(version_base=HYDRA_VERSION_BASE, config_dir=str(cdir)):
        cfg = compose(config_name="config", overrides=['AbbreviationFinderStep.exclude_abbrvs=[CFIm25,COPD,NSCLC,mCRC,NHL,DEND]'])

We also have an upcoming KAZU release, in which we can add "Cleavage factor Im 25" as a synonym to resolve this issue for everyone. It might not be for a week or two though.

Another option would be to turn off the AbbreviationFinderStep completely, but as described above, in this text and others, it does improve the output most of the time from our investigations.

Let me know how you get on!

David Twomey · Answer 3 · Mon Mar 04 2024 22:24:32 GMT+0800 (China Standard Time)

Thank you, i will try that. Since mostly i am trying to identify genes and diseases in abstracts, missing out on a gene associated with an abstract is a big deal. How large can the overrides be?

Elliot Ford · Answer 4 · Mon Mar 04 2024 22:36:41 GMT+0800 (China Standard Time)

I haven't tried for sure - if you're thinking about having a very large list, it would be easier to edit the kazu config - inside the model pack, there's a config file conf/AbbreviationFinderStep/default.yaml that currently looks like:

_target_: kazu.steps.document_post_processing.abbreviation_finder.AbbreviationFinderStep
exclude_abbrvs:
  - COPD
  - NSCLC
  - mCRC
  - NHL
  - DEND

and you can either edit it directly, or copy it into a new file in the same folder (e.g. with_overrides.yaml), and then pass it to hydra using an override of AbbreviationFinderStep=with_overrides.

I suspect you could likely get hundreds if not thousands of 'exclusions' in here and it handle it fine.

That said, that sounds like a lot of overrides for you to maintain - would it be more helpful if I added an option to the AbbreviationFinderStep to turn off just the behaviour of deleting mentions of abbreviations where the full string is unknown? Equally that would cause wrong entities for 'APA' in your document as described above.

Or maybe I can write a script to help use the AbbreviationFinderStep to see if there are 'missing' synonyms for abbreviations for a set of documents, to then add them as synonyms to the whole Kazu pipeline, which would help identify mentions like 'Cleavage factor Im 25' everywhere?

David Twomey · Answer 5 · Mon Mar 04 2024 22:49:10 GMT+0800 (China Standard Time)

I understand that changing the behaviour of the AbbreviationFinder will make this better but make other things worse. I like your last suggestion though.

If i did want to turn off AbbreviationFinder for testing, is it as easy as commenting out that step in the default.yaml under the pipeline folder?

Elliot Ford · Answer 6 · Mon Mar 04 2024 22:55:00 GMT+0800 (China Standard Time)

Yes exactly, that will turn it off completely. Let me know how you get on, and I can look at working on the script to identify 'missing' synonyms that cause this issue if you think it's going to fit your use case better than the other options (it's very likely a good idea for Kazu to do this ourselves in general, so I'm happy to explore the idea).

David Twomey · Answer 7 · Mon Mar 04 2024 22:57:12 GMT+0800 (China Standard Time)

Here is another example where it doesn't do a great job finding PCSK9 as a gene. It does tag it as a gene but it doesn't entity link it. Is teh AbbreviationFinder the reason why in this case also?

Is the issue if the very first mention of a gene symbol is in parenthesis after the full name and the full name version in the text is not in the ontology then every mention of that gene is removed after that.

text = "PROJECT SUMMARY Over 30% of the adult population of the United States has elevated levels of low-density lipoprotein cholesterol (LDL-C), a condition that is correlated with an increased risk of coronary heart disease and stroke. Lifestyle modifications and treatment with statins can be sufficient for the treatment of mildly elevated LDL-C, but a substantial percentage of patients on statins fail to meet recommended LDL-C goals. Thus, new approaches are needed to control LDL-C. Proprotein convertase subtilisin/kexin type 9 (PCSK9) is a molecule that modulates expression of the LDL receptor (LDL-R). Naturally occurring mutations that reduce the activity of PCSK9 are associated with decreased LDL-C levels and reduced risk of cardiovascular disease. More recently, it has been shown in clinical trials that PCSK9-targeted monoclonal antibodies (mAbs) can dramatically reduce LDL-C levels. The goal of this proposal is develop an active vaccination strategy to target PCSK9, as an alternative to mAb therapy. To do this we will use a virus-like particle (VLP) nanoparticle platform, which we have used previously to elicit high-titer antibody responses against PCSK9 and other self- antigen targets. In Aim 1 we will engineer VLP-based vaccines targeting different epitopes in PCSK9, and compare their immunogenicity and ability to reduce lipid levels in mice. In Aim 2 we will test candidate vaccine in hypercholesterolemic and atherosclerotic mouse models. In Aim 3 we will assess the immunogenicity and functionality of our lead PCSK9-VLP vaccine in non-human primates, and test its compatibility with statins. The long-term goal of this research is to generate effective vaccines that target human PCSK9 and reduce LDL-C, as a novel vaccine-based therapeutic treatment for heart disease."

Elliot Ford · Answer 8 · Mon Mar 04 2024 23:11:17 GMT+0800 (China Standard Time)

Hi, yes you're exactly right, it's the AbbreviationFinderStep here also - you're right about the issue description too.

The mention of 'PCSK9' that it does tag as a gene, it looks like it's the string 'human PCSK9' that is the whole 'hit' for the entity, so that's why one of them doesn't get removed.

David Twomey · Answer 9 · Mon Mar 04 2024 23:19:14 GMT+0800 (China Standard Time)

Let me know if i understand this.

The Abbreviation finder sees the text "Proprotein convertase subtilisin/kexin type 9 (PCSK9)". It assumes PCSK9 is an abbreviation for Proprotein convertase subtilisin/kexin type 9. First though it looks up "Proprotein convertase subtilisin/kexin type 9" in the entities list but doesn't find it therefore it removes all mention of PCSK9.

My question then is why is "Proprotein convertase subtilisin/kexin type 9" not in the entity list since it's the approved name for PCSK9 https://www.genecards.org/cgi-bin/carddisp.pl?gene=PCSK9

Thank you

Elliot Ford · Answer 10 · Mon Mar 04 2024 23:23:54 GMT+0800 (China Standard Time)

hmm, I agree that is strange - sorry I should have checked that. I can see it's a synonym in open targets too. It looks like it was ruled out in Kazu in some point, possibly by a heuristic. Let me check with one of my colleagues and get back to you. It's possible our upcoming release will resolve this, as it updates our 'curations' which decide on which synonyms from the ontologies we use should be treated as too noisy for string matching. On the face of it, I can't see why this synonym would suffer from that.

David Twomey · Answer 11 · Mon Mar 04 2024 23:24:20 GMT+0800 (China Standard Time)

btw: i did confirm that over-riding 'PCSK9' in AbbreviationFinder fixes this so it does confirm the issue.

PCSK9,100,gene,['PCSK9'],['ENSEMBL'],['ENSG00000169174'],"{Mapping(default_label='PCSK9', source='ENSEMBL', parser_name='OPENTARGETS_TARGET', idx='ENSG00000169174', string_match_strategy='ExactMatchMappingStrategy', string_match_confidence=<StringMatchConfidence.HIGHLY_LIKELY: 'HIGHLY_LIKELY'>, disambiguation_confidence=None, disambiguation_strategy='disambiguation_not_required', xref_source_parser_name=None, metadata={'dbXRefs': [], 'approvedName': 'proprotein convertase subtilisin/kexin type 9', 'annotation_score': 7, 'data_origin': '23.09'})}"

David Twomey · Answer 12 · Mon Mar 04 2024 23:26:48 GMT+0800 (China Standard Time)

Thank you for looking in to this

Elliot Ford · Answer 13 · Wed Mar 06 2024 18:29:25 GMT+0800 (China Standard Time)

I've just checked with my colleague, and the synonym"Proprotein convertase subtilisin/kexin type 9" will be included for the next release of Kazu - sorry about that!

I'm also hoping we might be able to try my idea of detecting 'missing synonyms' using the AbbreviationFinderStep sometime soon.

David Twomey · Answer 14 · Thu Mar 07 2024 00:46:54 GMT+0800 (China Standard Time)

Sounds good. Do you think they could quickly double check that all the HGNC approved full gene names are there?
I'm sure PCSK9 and CFIm25 are not the only instances of this.

Thank you

Elliot Ford · Answer 15 · Thu Mar 07 2024 00:50:17 GMT+0800 (China Standard Time)

Hi,

Yes, that's a good idea, we'll do that before the next release! I'll leave this issue open until that's been done.

David Twomey · Answer 16 · Thu Mar 07 2024 01:21:41 GMT+0800 (China Standard Time)

It's also interesting that this issue is not on the BERN2 server. But they probably don't use AbbreviationFInder and they combine protein and gene. Just curious

Elliot Ford · Answer 17 · Thu Mar 07 2024 01:43:32 GMT+0800 (China Standard Time)

Yes, that is interesting: looks like this must also be a case where our distilled version of BERN is performing worse than the full-size model.

In theory it's possible to configure TransformersModelForTokenClassificationNerStep to take the BERN2 model - but in practice, we haven't tried this recently using a gpu and have refactored since the last time we have - KAZU is ultimately a cpu-first framework as it stands, so there may be some pain there. Unless you're seeing other cases where BERN2 performs better I would recommend waiting for the new release where we will mitigate this issue with the AbbreviationFinderStep and synonym-based matching.

David Twomey · Answer 18 · Thu Mar 07 2024 21:25:44 GMT+0800 (China Standard Time)

Hi Elliot, one more thing related to AbbreviationFinder. In that same text above, it comes across VLP in this sentence first
"To do this we will use a virus-like particle (VLP) nanoparticle platform" but it doesn't flag it as an abbreviation. Therefore, later in the text when it comes across this text "we will engineer VLP-based vaccines targeting", it erroneously thinks VLP is a gene with high confidence.

Elliot Ford · Answer 19 · Sat Mar 09 2024 00:45:16 GMT+0800 (China Standard Time)

Hi, sorry this took me some time to look into: unfortunately its caused by the spacy tokenisation here: we search for the token 'VLP', but spacy has the tokens 'VLP-based' and 'PCSK9-VLP'. It seems like there's two options here for improvement:

Tweak the spacy tokenisation behaviour we're using to handle these cases specifically, but then there's always some risk of misalignment of tokens.
Rather than checking for 'VLP' as a complete token using a spacy Matcher, just check whether any entity matches have exactly the same string match. I think the only reason we aren't doing this is a 'hangover' of the AbbreviationFinder code coming from scispacy, and not having access to all the information Kazu has at that point. I think this would be a good change to make, but will take some refactoring (there's some risk I'm missing something in my re-reading of the code than makes this approach non-viable).

David Twomey · Answer 20 · Thu May 02 2024 00:33:06 GMT+0800 (China Standard Time)

Hi Elliot,
Just wondering if there are plans for an updated release anytime soon

Thank you

Elliot Ford · Answer 21 · Thu May 02 2024 00:36:17 GMT+0800 (China Standard Time)

Hi,

Sorry, the release has been slowed by lower availability of myself and the other core developer than expected, plus some more difficulty finalising the release than expected.

I'm hopefully the release will be next week - but since I over-promised and under-delivered last time, let's say sometime this month should be a realistic timeframe.

sorry about that!

Best,

Elliot

David Twomey · Answer 22 · Fri Jun 07 2024 22:45:47 GMT+0800 (China Standard Time)

Hi Elliot,

I'm trying out the new 2.0 release and i notice it's not doing as good a job of mapping to a MONDO ontology term. Is there a change in the config that turned this off?

Example:
Version 1.5.1
PD,100,disease,"['Parkinson disease', 'Parkinson disease']","['MONDO', 'MONDO']","['MONDO_0005180', 'http://purl.obolibrary.org/obo/MONDO_0005180']","{Mapping(default_label='Parkinson disease', source='MONDO', parser_name='OPENTARGETS_DISEASE', idx='MONDO_0005180', string_match_strategy='ExactMatchMappingStrategy', string_match_confidence=<StringMatchConfidence.HIGHLY_LIKELY: 'HIGHLY_LIKELY'>, disambiguation_confidence=<DisambiguationConfidence.HIGHLY_LIKELY: 'HIGHLY_LIKELY'>, disambiguation_strategy='PreferDefaultLabelMatchDisambiguationStrategy', xref_source_parser_name=None, metadata={'dbXRefs': ['Orphanet:319705', 'OMIMPS:168600', 'ICD9:332.0', 'SCTID:49049000', 'MESH:D010300', 'EFO:0002508', 'UMLS:C0030567', 'ICD9:332', 'NCIT:C26845', 'NIFSTD:birnlex_2098', 'DOID:14330'], 'data_origin': '23.09'}), Mapping(default_label='Parkinson disease', source='MONDO', parser_name='MONDO', idx='http://purl.obolibrary.org/obo/MONDO_0005180', string_match_strategy='ExactMatchMappingStrategy', string_match_confidence=<StringMatchConfidence.HIGHLY_LIKELY: 'HIGHLY_LIKELY'>, disambiguation_confidence=<DisambiguationConfidence.HIGHLY_LIKELY: 'HIGHLY_LIKELY'>, disambiguation_strategy='PreferDefaultLabelMatchDisambiguationStrategy', xref_source_parser_name=None, metadata={'data_origin': '2023-09-12'})}"

Version 2.0.0
PD,100,disease,,,,set()