CogStack / MedCATtrainer

A simple interface to inspect, improve and add concepts to biomedical NER+L -> MedCAT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unclear error message when "Use for training" is unchecked

sandertan opened this issue · comments

I'm getting an unclear error when uploading a new CDB. The container log doesn't show any errors. Is there another way to debug this?

image

Hi @sandertan -
which versions of the trainer (container version) are you using?
which CDB are trying to upload? how large is it?

hi @tomolopolis , I cloned it from git (master branch, commit c7d4471, authored Wed Aug 11). The CDB was made with one of the latest versions of MedCAT (spacy 3). CDB is 630 MB.

@sandertan thanks - so looks like the latest container needs to be rebuilt to include the en_core_sci_lg-0.4.0 model that's compatible with spacy v3.

We've had issues with DockerHub lately as they've restricted AutoBuild. Will rebuild now

latest versions have been uploaded on docker hub

Thanks @tomolopolis for looking into this. We're building our own image from a slightly modified Dockerfile that includes downloading a Dutch spaCy model (because it performs better for lemmatization). Not much else has changed.

About a month ago we didn't have any problems with this, so perhaps a recent MedCAT change caused this. I could build an English model with default settings with MedCAT master and load it into a recent MCT with default scispacy model, but that'll take me quite some time, so I was wondering what would be the best way to debug this / display the application log.

Also, probably unrelated to this, but I couldn't upload an xlsx dataset because Python was missing the openpyxl library. Not sure what happened there, I didn't have issues with this in the past. CSVs still work so not a huge problem.

@sandertan ah right -
do any CDBs load in your version of MCT? Or is it just this latest built CDB from the latest version of MedCAT? If you can copy the offending CDB directly to the container filesystem you can try running the MedCAT cdb load, CAT instantiation directly:

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT
cdb = CDB.load('')
vocab = Vocab.load('vocab-mc-v1.dat')
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)
cat('some foo tex')

@tomolopolis We figured out the issue: When adding a concept db, if "Use for training" is unchecked, MCT will not be able to use the CDB and will display the error message. Also when "Train model on submit" is unchecked it will generate this error message.

For an annotation project (to create a validation set to assess MedCAT on a language & domain specific datasets), we weren't particularly interested in training the CDB in MCT, but it's also no problem to do it anyway.

It would be nice to fix the functionality or display a clear error message.

good find - will get this fixed in the next patch release. Yes, ideally you would leave "use for training" unchecked, as having it learn between each document may introduce some sort of bias in the production of your validation set.

@sandertan - just tried to reproduce this on master and looks like this is fixed. When the next patch version of medcat is released, I'll rebuild the container images and check if this issue is still there

Hi @tomolopolis I'm still experiencing this issue with a build on commit 2813910 (Oct 9). Reproduced it in a clean instance. In short: If "Use for training" is unchecked, the new project won't save, or in an existing project I can't select the CDB.

If I check that box, and uncheck "Train model on submit" at the project configuration, the CDB won't be altered, right? In that case this isn't really a problem for us.

Hi @sandertan - okay, will take another look. The setting "Use for training" is supposed to signify if:

  • checked: a CDB is available in a project for the CUI lookup
  • unchecked: a CDB will not be available in the drop-drown under a project, but will be available for "importing concepts" into the Concept Table, that supplied the drop-down for the alternative / new annotation searches

I guess "Use For Training" should actually be renamed to "Use in Projects" or something better... the intention of the above logic was that there would be one 'master' or 'complete' CDB that would contain all concepts for a given terminology. Subsequent CDBs that would be used in projects (have "Use For Training" ticked), might contain filtered down representations, i.e. only just some subset of symptoms or findings etc. Crucially the CUIs would match on both 'complete' and the project specific CDBs, so that when searching and adding missing annotations on any given document you'd still be using the same CUIs.

Yes - if you uncheck, Train model on submit, the model is unaffected. In fact the model only remains online learning 'trained' in memory, so we actually recommend once you've collected annotations to export and re-train a fresh instance of the CDB using the exported data, this also allows for printing of metrics and a usage of a less aggressive learning rate.

Also FYI - there's a new release v2.2.0 that upgrades MedCAT, and fixes some file handling bugs etc, that if memory serves you raised :)

👍 Thanks for explaining. We're currently using different master CDBs (different MedCAT parameters, MedCAT versions, and different concept systems). Setting up a new instance or resetting an existing one with Docker is very easy, so the current system is fine for us :)