allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.

Home Page:https://allenai.github.io/scispacy/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Filtering CUI/TUI returned entities?

ddofer opened this issue · comments

When doing NER/NEL to UMLS/CUI entities, is there any way to configure the nlp pipe to exclude candidates by a predefined filtering list of CUIs or TUIs? e.g. to exclude any detected CUIs with TUI: T079 (Temporal Concept)?

Currently I'm doing it by post-hoc filtering, which is both inelegant, inneffecient, and doesn't help remove noisy detections. i.e., if the linker returns the first detected entity froma text, then post-hoc filtering to remove the TUI means I miss the relevant entities.

Current code extract:

`nlp.add_pipe("scispacy_linker",
config={"resolve_abbreviations": True,
"linker_name": "umls",
"max_entities_per_mention": 4, #5
"threshold":0.87 ## default is 0.8, paper mentions 0.99 as thresh
})
#...

EXCLUDE_TUIS_LIST = ["T079","T093"] #List of umls cui semtypes to exclude.

novel_cols_candidates_names = []
no_entities_list = []

novel_candidate_cuis = []
novel_candidate_cuis_nomenclatures = []
TUIs_list = []

for f in icu_feature_terms["name"]:
print(f)
doc =nlp(f)
linker = nlp.get_pipe("scispacy_linker")

if len(doc.ents)>0:
    for j,entity in enumerate(doc.ents):
        print(f"Entity #{j}:{entity}")
       
        list_feature_cuis = [i[0] for i in entity._.kb_ents]

        ## add tui filt
        s1 = len(list_feature_cuis)
        # print(s1)
        tui_filter_mask = [linker.kb.cui_to_entity[c][3][0] not in EXCLUDE_TUIS_LIST for c in list_feature_cuis]
        list_feature_cuis = list(compress(list_feature_cuis,tui_filter_mask))

     
        list_cuis_nomenclatures = [linker.kb.cui_to_entity[i[0]][1] for i in entity._.kb_ents]
        # linker = nlp.get_pipe("scispacy_linker") #ORIG
        list_cuis_nomenclatures = list(compress(list_cuis_nomenclatures,tui_filter_mask))
        
        num_candidates = len(list_feature_cuis)
        for c in list_feature_cuis:
            TUIs_list.append(linker.kb.cui_to_entity[c][3][0]) # c[0]][3][0])

            for cui in list_feature_cuis:
              novel_cols_candidates_names.extend([f]*(num_candidates))
              novel_candidate_cuis.extend(list_feature_cuis)
              novel_candidate_cuis_nomenclatures.extend(list_cuis_nomenclatures)

else:
    no_entities_list.append(f)
    print(f"No Entity candidates for {f}")

`

Hi, this is not something exists right now, although is a reasonable feature request if you wanted to give implementing it a go! Otherwise, I recommend doing what you are doing and post hoc filtering (setting the threshold such that you get enough candidates after filtering)