georgemilosh / Climate-Learning

How to predict extreme events in climate using rare event algorithms and modern tools of machine learning

Home Page:https://georgemilosh.github.io/Climate-Learning/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The new runs should carry the label `R` before completion

georgemilosh opened this issue · comments

Currently when the trainer starts the run they are called something like ./7--nfolds__5-- .... unless exception causes them to be renamed to ./F7 .... which is convenient

This works well on CBP but not on IPSL mesocenter/ Jean-zay , where the latter is a major French cluster with powerful GPUs. What happens on these clusters is that if RAM is oversaturated the processes are terminated without calling python exceptions thus the run is not renamed and remains ./7--nfolds__5-- ..... In runs.json they do keep the label RUNNING, which is still helpful. I don't see a way this can be made to change to FAIL since the termination of python script happens from the outside.

I think it would be nice if when the run folder was created it would carry the label R in front. If the runs were to succeed, the R would be removed.