ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Resume multi-run

nurlanov-zh opened this issue · comments

Hi,

Is it possible to resume a multi-run? E.g. if the Optuna hyperparameter search has crashed, can we resume the search from that point without having to sample new runs?

commented

Not possible as far as I'm aware.

I think it's best to write a dedicated task / pipeline for hyperparameter search if you want to be able to resume.

As someone who's implemented Hydra-aware resuming from pre-emption on multi-runs for hyperparameter search with both wandb's sweeper and another sweeper made by some colleagues, I can 100% agree with the suggestion of writing a dedicated pipeline for it. Each sweeper (and their Hydra plugins) operates quite differently and handle resuming from runs very differently. It would be completely infeasible to have this template cover the use case for all Hydra-supported sweepers. You would need to integrate this functionality both on this template and on the Hydra sweeper, i.e., within the sweeper plugin code. Take a look at this gross PR I made for getting it to work for wandb (among other features). It gets messy real fast.

@tesfaldet thanks for the links! I will take a look