kermitt2 / delft

a Deep Learning Framework for Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

improve directory management for training/nfold-training

lfoppiano opened this issue · comments

This issue is to cover a revision of the directory management on the training or n-fold training.

Right now the training write directly under /data/models/model-name. For training is not a problem because it's done at the end.
For nfold trainng it uses the data/models/xx/model-name as temporary directory which could be left.
e.g.

-rw-r--r-- 1 lfoppian0 tdm 1.1K Feb 25 11:40 config.json
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 25 12:11 model_weights0.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 13:55 model_weights1.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 15:00 model_weights2.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 15:34 model_weights3.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 16:19 model_weights4.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 17:07 model_weights5.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 17:41 model_weights6.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 18:12 model_weights7.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 18:47 model_weights8.hdf5
-rw-r--r-- 1 lfoppian0 tdm 425M Feb 17 19:45 model_weights9.hdf5

The fold 0 was overriden, then for any reason (e.g. the user stop the training, something crashes) the process stops, the directory is left half written.

We could use a temporary directory where the models are saved and then copied back in the data/model directory once everthing finishes.