Samsung / ONE

On-device Neural Engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to save(checkpoint) training state into file ?

chunseoklee opened this issue · comments

Let's define the checkpoint format for onert/onert-micro.

Here, the term checkpoints are any serialization format for resuming the training process afterward ( Note that checkpoint in keras is a set of parameters without graph information. The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint).)

Things we consider to save :

  • the model's configuration (topology)
  • the model's weights
  • the model's optimizer's state (if any)

Let's see current state before training. Thus, training model format :

  • onert-micro
    Based on #12892, onert-micro imports 3 files for training :1) circle model w/o weights, 2) weight file, and 3) backprop graph.

  • onert
    onert takes only 1 circle(+) model.

I'd like to integrate the checkpoint into inference format as follows (Q. we can call this circle+ ?) :

  • A checkpoint is a single directory ( like nnpackage )

    • It consists of at most 4 files :
      1. circle model (with training info) : training info in circle model can be served as checkpoint meta information
      2. Weight File : This is optional if circle model contains W in circle file
      3. Optimizer Variable(e.g. Adam weight)
      4. Backprop Graph
    • 2, 3 and 4 are optional depending on implementation
    • when user save a checkpoint, training info, weight, and optimizer var are updated and serialized
  • A inference model can be generated simply by removing 3 and 4 + training info or checkpoint can be used in inference as it is.

  • Q. how to provide API if checkpoint is a directory ? Note that usually we pass a model file as a model path to application.

Here is a tentative conclusion at offline discussion with @Samsung/one_onert :

  • Checkpoint is a single file which have weight, optimizer variable, and checkpoint meta information(e.g. step ..)
  • Two APIs for loading model and loading checkpoint, respectively. To resume a train, load circle model and then, load checkpoint and continue to train.

https://tutorials.pytorch.kr/recipes/recipes/saving_and_loading_a_general_checkpoint.html

step is not saved into checkpoint since (IMHO) its training unit period is epoch, not step.

https://keras.io/api/callbacks/model_checkpoint/

save_freq: "epoch" or integer. When using "epoch", the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch, since the metrics get reset every epoch). Defaults to "epoch".

Here is my understanding about workflow for #12997 (comment).

At initial training,

nnfw_load_model_from_file( session, model.circle )

.... 
nnfw_train()
.....

nnfw_save_checkpoint(session, /path/model.ckpt)

To resume

nnfw_load_model_from_file(session, circle.model)
nnfw_load_checkpoint(session, "/path/model.ckpt") 
....
nnfw_train()
....
nnfw_save_checkpoint("/path/another.cpkt")

Here, weight buffer should be reinitialized from checkpoint

@jyoungyun @hseok-oh Please let me know if any strange in my understanding.

Let's discuss the checkpoint format at #13037

@chunseoklee,
#12997 (comment) - looks good to me, thank you!
About saving result model:
is this a separate method that, using a session

NNFW_STATUS nnfw_train_export_circle(nnfw_session *session, const char *path);
(in which a pointer to the model is saved), we save the model without using the checkpoint in any way? If we want to save the model using the best checkpoint, do we first load the model from this checkpoint, and then save the model, right?

nnfw_train_export_circle will save "training done" model (in memory/session) into circle and no checkpoint is generated.

If we want to save the model using the best checkpoint, do we first load the model from this checkpoint, and then save the model, right?

Yes though it seems little bit weird...

Yes though it seems little bit weird...

I think, it is okay :)

nnfw api for checkpoint :

NNFW_STATUS nnfw_train_export_checkpoint(nnfw_session *session, const char *path);
NNFW_STATUS nnfw_train_import_checkpoint(nnfw_session *session, const char *path);

You can find the draft implementation for nnfw api based on onert-micr training at https://github.com/chunseoklee/ONE/tree/v3