CogStack / MedCATtrainer

A simple interface to inspect, improve and add concepts to biomedical NER+L -> MedCAT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preferred name from incorrect CDB file

sandertan opened this issue · comments

Hi @tomolopolis another question and/or bug: I've been cleaning up the concept table with Dutch words that we use for creating the MedCAT CDB file. I noticed that when I use a new CDB file, for a new or existing project, that the "Concept Summary" can still display the "Name"/preferred name/pretty name of the concept from a difference CDB file. I'm not experienced in Vue so I haven't been able to pinpoint the root of this issue, but I think this is a bug. Edit: The API returns the wrong pretty_name, I'll look into this tomorrow.

Example

In the last screenshot, the preferred name of "pneumothorax" contains "NAO". This is a suffix in some names from Dutch MedDRA, and not useful for entity linking, so I removed this from the concept table and generated a new CDB file. I tested it in a Jupyter notebook with MedCAT, and the issue seems resolved there:
image

Also, in this new CDB I've added a new concept, "methotrexaat" to verify MCT uses the updated CDB. (I still need to add a TUI to this concept so dont worry about that).

In MedCATtrainer, the new concept "methotrexaat" is correctly identified, so I'm sure the updated CDB is in use. But the preferred name still contains "NAO". I suspect this name is retrieved from a different CDB file in the same MedCATtrainer instance.
image

Seems to be caused around here:

pretty_name = ""
I'll try to debug it.

I think the application uses a general CUI lookup table that spans across projects, because the GET request for filling in this "Concept Summary" can only pass the CUI, not the ConceptDB ID: /api/concepts/?cui=C0032326

Do you think it would make sense to extract the pretty name from the project-specific CDB instead? In our current use case we're experimenting with different concept databases with different preferred names for CUIs, because we have to do quite some preprocessing to get a clean list of dutch concept names.

yes you're correct in thinking the concepts table has a primary key of the the concept cui from a given CDB. The original intention around this was to have potentially many projects using one CDBs worth of concepts, and therefore not forcing folks to import concepts from each CDB per project, but this has been limiting at times.

We could look to improve the concepts table somehow or the concept pretty name lookup could be improved to alternatively look within the project specific CDB.

Let me know if you would like me to look into adding this change. I'm not sure though if other parts of the application rely on this "one concept table" paradigm as well.

@tomolopolis We can close this one for now. When using different CDB universes with different pretty names, this issue can be solved by setting up multiple MedCATTrainer instances.