The new runs should carry the label `R` before completion
georgemilosh opened this issue · comments
Currently when the trainer starts the run they are called something like ./7--nfolds__5-- ....
unless exception causes them to be renamed to ./F7 ....
which is convenient
This works well on CBP but not on IPSL mesocenter/ Jean-zay , where the latter is a major French cluster with powerful GPUs. What happens on these clusters is that if RAM is oversaturated the processes are terminated without calling python exceptions thus the run is not renamed and remains ./7--nfolds__5-- ....
. In runs.json
they do keep the label RUNNING
, which is still helpful. I don't see a way this can be made to change to FAIL
since the termination of python script happens from the outside.
I think it would be nice if when the run folder was created it would carry the label R
in front. If the runs were to succeed, the R
would be removed.