Daisy chain batch jobs
adammoody opened this issue · comments
The job array works well to queue up multiple jobs:
https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#queue-up-multiple-training-jobs
Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in train.slurm
you'd have a line like:
# when train.slurm executes, have it submit another job dependent on itself
sbatch --dependency=$SLURM_JOBID train.slurm
This is usually done near the top of the script, before the command that actually launches the run.
One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.
# exit right away if "run.done" file is detected
if [ -f run.done ] ; then
exit 0
fi
# otherwise chain up another job
sbatch --dependency=$SLURM_JOBID train.slurm
# then launch the run
<<launch run>>>
Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.
I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.
Thank you for these suggestions, Adam.
--dependency
is already covered in the SLURM guide, but you're making a good connection with the fault-tolerance section.
And for the file dropping I used the concept of the kill switch so it's there already.
so I combined both of your suggestions and pushed this:
closing this for now - but please don't hesitate to continue if more things can be improved.