Unclear error message when "Use for training" is unchecked

Question

Unclear error message when "Use for training" is unchecked

sandertan opened this issue 3 years ago · comments

I'm getting an unclear error when uploading a new CDB. The container log doesn't show any errors. Is there another way to debug this?

tomolopolis · Answer 1 · Tue Sep 21 2021 07:54:26 GMT+0800 (China Standard Time)

Hi @sandertan -
which versions of the trainer (container version) are you using?
which CDB are trying to upload? how large is it?

Sander Tan · Answer 2 · Tue Sep 21 2021 15:29:43 GMT+0800 (China Standard Time)

hi @tomolopolis , I cloned it from git (master branch, commit c7d4471, authored Wed Aug 11). The CDB was made with one of the latest versions of MedCAT (spacy 3). CDB is 630 MB.

tomolopolis · Answer 3 · Tue Sep 21 2021 18:04:23 GMT+0800 (China Standard Time)

@sandertan thanks - so looks like the latest container needs to be rebuilt to include the en_core_sci_lg-0.4.0 model that's compatible with spacy v3.

We've had issues with DockerHub lately as they've restricted AutoBuild. Will rebuild now

tomolopolis · Answer 4 · Tue Sep 21 2021 19:57:46 GMT+0800 (China Standard Time)

latest versions have been uploaded on docker hub

Sander Tan · Answer 5 · Tue Sep 21 2021 23:46:58 GMT+0800 (China Standard Time)

Thanks @tomolopolis for looking into this. We're building our own image from a slightly modified Dockerfile that includes downloading a Dutch spaCy model (because it performs better for lemmatization). Not much else has changed.

About a month ago we didn't have any problems with this, so perhaps a recent MedCAT change caused this. I could build an English model with default settings with MedCAT master and load it into a recent MCT with default scispacy model, but that'll take me quite some time, so I was wondering what would be the best way to debug this / display the application log.

Also, probably unrelated to this, but I couldn't upload an xlsx dataset because Python was missing the openpyxl library. Not sure what happened there, I didn't have issues with this in the past. CSVs still work so not a huge problem.

tomolopolis · Answer 6 · Wed Sep 22 2021 00:26:19 GMT+0800 (China Standard Time)

@sandertan ah right -
do any CDBs load in your version of MCT? Or is it just this latest built CDB from the latest version of MedCAT? If you can copy the offending CDB directly to the container filesystem you can try running the MedCAT cdb load, CAT instantiation directly:

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT
cdb = CDB.load('')
vocab = Vocab.load('vocab-mc-v1.dat')
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)
cat('some foo tex')

Sander Tan · Answer 7 · Thu Sep 23 2021 19:56:17 GMT+0800 (China Standard Time)

@tomolopolis We figured out the issue: When adding a concept db, if "Use for training" is unchecked, MCT will not be able to use the CDB and will display the error message. Also when "Train model on submit" is unchecked it will generate this error message.

For an annotation project (to create a validation set to assess MedCAT on a language & domain specific datasets), we weren't particularly interested in training the CDB in MCT, but it's also no problem to do it anyway.

It would be nice to fix the functionality or display a clear error message.

tomolopolis · Answer 8 · Thu Sep 23 2021 20:40:25 GMT+0800 (China Standard Time)

good find - will get this fixed in the next patch release. Yes, ideally you would leave "use for training" unchecked, as having it learn between each document may introduce some sort of bias in the production of your validation set.

tomolopolis · Answer 9 · Fri Sep 24 2021 18:32:13 GMT+0800 (China Standard Time)

@sandertan - just tried to reproduce this on master and looks like this is fixed. When the next patch version of medcat is released, I'll rebuild the container images and check if this issue is still there

Sander Tan · Answer 10 · Tue Nov 09 2021 22:18:55 GMT+0800 (China Standard Time)

Hi @tomolopolis I'm still experiencing this issue with a build on commit 2813910 (Oct 9). Reproduced it in a clean instance. In short: If "Use for training" is unchecked, the new project won't save, or in an existing project I can't select the CDB.

If I check that box, and uncheck "Train model on submit" at the project configuration, the CDB won't be altered, right? In that case this isn't really a problem for us.

tomolopolis · Answer 11 · Thu Nov 11 2021 01:47:41 GMT+0800 (China Standard Time)

Hi @sandertan - okay, will take another look. The setting "Use for training" is supposed to signify if:

checked: a CDB is available in a project for the CUI lookup
unchecked: a CDB will not be available in the drop-drown under a project, but will be available for "importing concepts" into the Concept Table, that supplied the drop-down for the alternative / new annotation searches

I guess "Use For Training" should actually be renamed to "Use in Projects" or something better... the intention of the above logic was that there would be one 'master' or 'complete' CDB that would contain all concepts for a given terminology. Subsequent CDBs that would be used in projects (have "Use For Training" ticked), might contain filtered down representations, i.e. only just some subset of symptoms or findings etc. Crucially the CUIs would match on both 'complete' and the project specific CDBs, so that when searching and adding missing annotations on any given document you'd still be using the same CUIs.

Yes - if you uncheck, Train model on submit, the model is unaffected. In fact the model only remains online learning 'trained' in memory, so we actually recommend once you've collected annotations to export and re-train a fresh instance of the CDB using the exported data, this also allows for printing of metrics and a usage of a less aggressive learning rate.

Also FYI - there's a new release v2.2.0 that upgrades MedCAT, and fixes some file handling bugs etc, that if memory serves you raised :)

Sander Tan · Answer 12 · Tue Nov 23 2021 23:30:45 GMT+0800 (China Standard Time)

👍 Thanks for explaining. We're currently using different master CDBs (different MedCAT parameters, MedCAT versions, and different concept systems). Setting up a new instance or resetting an existing one with Docker is very easy, so the current system is fine for us :)