Updates and Documentation for WandB sweeps

Question

Updates and Documentation for WandB sweeps

Adamits opened this issue a year ago · comments

Based on other work were doing, we should add some documentation and make necessary tweaks for running a W&B sweep with this codebase.

Add documentation and examples of running WandB sweeps with Yoyodyne.
Make updates to codebase so PTL and WandB play nice wrt logging hyperparameters, etc.
Update PTL to log max validation accuracy.

Kyle Gorman · Answer 1 · Fri May 26 2023 04:20:59 GMT+0800 (China Standard Time)

Let me also add that the documentat should probably show how to retrieve best runs from the wandb API too.

Adam · Answer 2 · Fri May 26 2023 04:40:47 GMT+0800 (China Standard Time)

I guess relatedly it would also be nice to have a system for easily pointing W&B run id's to yoyodyne logging, etc.

Adam · Answer 3 · Thu Jun 29 2023 01:24:57 GMT+0800 (China Standard Time)

Working on this now. Was wondering if you think we should add train args so that it can be called in such a way that a wandb agent trains from a sweep (by adding a wandb_sweep_id and max_num_runs arg), or if this should be a seperate scripts that we maintain in the library (something like train_wandb_agent.py).

Other notes:

I was not able to find anything on how to log the max validation accuracy in PTL, and let it propogate that logging to wandb, so instead I just do wandb.define_metric('val_accuracy', summary='max') when wandb logging is enabled.
PTL tries to log the model hparams to the wandb run when the WandbLogger is enabled, causing a warning, because they also get logged when we start the sweep agent. See here: wandb/wandb#2641. I do not know how to fix this, since it does not seem to be PTL behavior I can toggle, and we need the PTL WandbLogger in order to also log runtime metrics. I think we can just let it happen it for now?

Kyle Gorman · Answer 4 · Thu Jun 29 2023 01:32:21 GMT+0800 (China Standard Time)

Working on this now. Was wondering if you think we should add train args so that it can be called in such a way that a wandb agent trains from a sweep (by adding a wandb_sweep_id and max_num_runs arg), or if this should be a seperate scripts that we maintain in the library (something like train_wandb_agent.py).

While I'm not sure I have enough context to get this yet, I think I am fine just including docs and a sample script for doing wandb stuff. It's hard for me to imagine doing this effectively using yoyodyne-train alone, I guess? I assume you did your sweeping using custom Python, right?

I was not able to find anything on how to log the max validation accuracy in PTL, and let it propogate that logging to wandb, so instead I just do wandb.define_metric('val_accuracy', summary='max') when wandb logging is enabled.

SGTM.

PTL tries to log the model hparams to the wandb run when the WandbLogger is enabled, causing a warning, because they also get logged when we start the sweep agent. See here: [CLI] wandb: WARNING Config item 'hyperparam_name' was locked by 'sweep' (ignored update) wandb/wandb#2641. I do not know how to fix this, since it does not seem to be PTL behavior I can toggle, and we need the PTL WandbLogger in order to also log runtime metrics. I think we can just let it happen it for now?

Let's just suppress the warning in __init__.py then, and add a TODO to investigate this at the PTL level later.

Adam · Answer 5 · Thu Jun 29 2023 01:42:24 GMT+0800 (China Standard Time)

Working on this now. Was wondering if you think we should add train args so that it can be called in such a way that a wandb agent trains from a sweep (by adding a wandb_sweep_id and max_num_runs arg), or if this should be a separate script that we maintain in the library (something like train_wandb_agent.py).

While I'm not sure I have enough context to get this yet, I think I am fine just including docs and a sample script for doing wandb stuff. It's hard for me to imagine doing this effectively using yoyodyne-train alone, I guess? I assume you did your sweeping using custom Python, right?

Yeah I just have a train_wandb_agent.py script that calls the functions in train.py. So do we need a directory at the top-level of our repository called examples or similar? Or do you think its better to have train_wandb_agent.py live alongside train.py?

Let's just suppress the warning in __init__.py then, and add a TODO to investigate this at the PTL level later.

Sounds good!

Kyle Gorman · Answer 6 · Thu Jun 29 2023 01:48:39 GMT+0800 (China Standard Time)

Working on this now. Was wondering if you think we should add train args so that it can be called in such a way that a wandb agent trains from a sweep (by adding a wandb_sweep_id and max_num_runs arg), or if this should be a separate script that we maintain in the library (something like train_wandb_agent.py).

While I'm not sure I have enough context to get this yet, I think I am fine just including docs and a sample script for doing wandb stuff. It's hard for me to imagine doing this effectively using yoyodyne-train alone, I guess? I assume you did your sweeping using custom Python, right?

Yeah I just have a train_wandb_agent.py script that calls the functions in train.py. So do we need a directory at the top-level of our repository called examples or similar? Or do you think its better to have train_wandb_agent.py live alongside train.py?

Yes that's what I'd suggest. I'd have one for running the sweep and, optionally, one for grabbing the results from W&B.

I don't know if we need to modify the project file to register the existence of that directory, but prevent it from being installed as part of the package...something to look out for: browse the verbose installation info and you should see what happens there.

Adam · Answer 7 · Fri Jun 30 2023 00:53:46 GMT+0800 (China Standard Time)

@kylebgorman Should we leave this open until we've played with the examples and are sure the scripts are sufficient, and documentation is good enough?

Kyle Gorman · Answer 8 · Fri Jun 30 2023 00:56:26 GMT+0800 (China Standard Time)

Okay, sure. I'd like to take it for a spin first.

Adam · Answer 9 · Fri Jun 30 2023 00:56:50 GMT+0800 (China Standard Time)

Sorry, I just meant this issue -- not the PR!

Kyle Gorman · Answer 10 · Fri Jun 30 2023 01:02:01 GMT+0800 (China Standard Time)

Sorry, I just meant this issue -- not the PR!

Got it, yea I was confused at first.