Metrics visualization during training and evaluation

Question

Metrics visualization during training and evaluation

GigSam opened this issue 2 years ago · comments

On the "minimal" branch, differently from what was done in the example.ipynb file of the previous version of the repo (the one without pytorch lightning, similar to the branch "3-improve-repository-organization"), it seems that you don't use the logs of the Opt, Exp and Hmean metrics when the training is performed. I would like to visualize those metrics, but the "metrics" folder isn't created by running the train.py script. Thank you for your support.

Ryo Yonetani · Answer 1 · Sun Jan 15 2023 08:19:17 GMT+0800 (China Standard Time)

Hi, thank you for your post!

If you open a tensorboard you can see the progress of those metrics (p_opt, p_exp, and h_mean) as shown here: #4 (comment) is this what you are looking for?

luigidamico100 · Answer 2 · Sun Jan 15 2023 08:43:41 GMT+0800 (China Standard Time)

Hi 😀
I am facing the same issue, I am not able to see the training progress (loss and metrics) because no log files are generated. Is this normal?

Thank you!

Ryo Yonetani · Answer 3 · Sun Jan 15 2023 08:46:52 GMT+0800 (China Standard Time)

Thank you! At least when working on #9, all the metrics were logged as intended. Will look into it.

Ryo Yonetani · Answer 4 · Mon Jan 16 2023 07:19:23 GMT+0800 (China Standard Time)

Hi! I've been investigating this issue but am having difficulty reproducing it. If I clone the repository, create venv, and run train.py, the metrics were logged on tb as follows.

My environment is with:

WSL2 (Ubuntu 20.04) on Windows 11
venv created with python==3.8
tensorboard==2.11.0
pytorch-lightning==1.8.5.post0

I will try other envs and module versions, but would it be possible to share your environment and versions of related modules (maybe tb and ptl versions may affect?) that cause this logging issue? or did you get any warning messages for logging failures? @GigSam @luigidamico100

Thank you!

SamueleGigante · Answer 5 · Wed Jan 18 2023 02:56:41 GMT+0800 (China Standard Time)

@yonetaniryo my environment is:

Windows 11
venv created with python==3.10.9
tensorboard==2.10.1
pytorch-lightning==1.8.5.post0

The problem is that by cloning the repo, creating and activating venv and running train.py i don't see any "metrics" folder nor any log produced by the training file, even if the algorithm works fine and no warning for logging is produced. I really don't know what's causing this issue.

Ryo Yonetani · Answer 6 · Wed Jan 18 2023 07:10:58 GMT+0800 (China Standard Time)

Thank you for sharing your environment. Just wanted to make sure that the logs are stored in model/mazes_032_moore_c8/lightning_logs/version_* for mazes_032_moore_c8, not in metrics. Also we have the only checkpoint in model/mazes_032_moore_c8/lightning_logs/version_0 on github to reduce the repository size. When you clone the repo and start the training, the following dir and files should appear:

model/mazes_032_moore_c8/lightning_logs/version_1:
checkpoints  events.out.tfevents....  hparams.yaml

Ryo Yonetani · Answer 7 · Wed Jan 18 2023 08:12:56 GMT+0800 (China Standard Time)

I have checked the logging in the environment as close as that of @GigSam with python3.10.9 and tensorboard==2.10.1 used. However I'm not yet able to reproduce the issue. Can you double check if the logs are stored in model dir? Or you may try using our Dockerfile that will give us exactly the same environment. Thank you!

Ryo Yonetani · Answer 8 · Fri Jan 20 2023 13:53:57 GMT+0800 (China Standard Time)

Sorry but I’m going to close this issue because I cannot reproduce the logging problem. If someone encounters the same problem please check if the metrics data are stored in model directory. And please don’t hesitate to re-open the issue if you can reproduce the problem. Thank you for the report!