All values are at 0 epoch

Question

All values are at 0 epoch

ShownX opened this issue 7 years ago · comments

Hello,

I just use mx.contrib.tensorboard.LogMetricsCallback() in the model.fit(batch_end_callback=..., eval_batch_end_call_back=...)

It generates the log file, but when I run tensorboard --logdir=...,

it generates a figure like follows:

Any ideas what I did wrong?

Zihao Zheng · Answer 1 · Fri Jun 16 2017 09:41:01 GMT+0800 (China Standard Time)

Try relative mode on the left.

Xiang Xu · Answer 2 · Fri Jun 16 2017 09:47:22 GMT+0800 (China Standard Time)

@zihaolucky what do exactly you mean that relative mode on the left? I am confused.

Zihao Zheng · Answer 3 · Fri Jun 16 2017 15:20:51 GMT+0800 (China Standard Time)

@ShownX As I haven't add step params in scalar, so you have to select the relative mode of TensorBoard.

Xiang Xu · Answer 4 · Fri Jun 16 2017 21:40:47 GMT+0800 (China Standard Time)

Thank you very much!

Arun Das · Answer 5 · Mon Mar 05 2018 09:23:55 GMT+0800 (China Standard Time)

@zihaolucky , Could you please guide me in the right direction on how to add the step param in scalar ? Relative mode is very bad as of now. Gives me very odd results in the training graph.

I am calling tensorboard op on every batch_end.

batch_end_callbacks    += [mx.contrib.tensorboard.LogMetricsCallback(training_log)]

Tensorboard logs every batch but individually (no stitching between batches) giving batches number of graphs in the Relative mode.

Zihao Zheng · Answer 6 · Mon Mar 05 2018 14:48:37 GMT+0800 (China Standard Time)

Hi @arundasan91

The reason it looks ugly is we log train & valid/test data points with different time scale. You can write another callback function and pass the step explicitly, then use STEP mode.

Arun Das · Answer 7 · Mon Mar 05 2018 20:30:10 GMT+0800 (China Standard Time)

Hi @zihaolucky, I was able to figure it out but forgot to update you. I passed the params.epoch to global_steps in tensorboard.py and it worked as intended. Thank you so much for the wonderful project!
Do you have any idea on why the batch_end_caallback gives discontinuous graphs? Some accuracy values are nan when I download the csv but prints out perfectly to shell while training.