stas00 / ml-engineering

Machine Learning Engineering Open Book

Home Page:https://stasosphere.com/machine-learning/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Daisy chain batch jobs

adammoody opened this issue · comments

The job array works well to queue up multiple jobs:

https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#queue-up-multiple-training-jobs

Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in train.slurm you'd have a line like:

# when train.slurm executes, have it submit another job dependent on itself
sbatch --dependency=$SLURM_JOBID train.slurm

This is usually done near the top of the script, before the command that actually launches the run.

One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.

# exit right away if "run.done" file is detected
if [ -f run.done ] ; then
  exit 0
fi

# otherwise chain up another job
sbatch --dependency=$SLURM_JOBID train.slurm

# then launch the run
<<launch run>>>

Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.

I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.

Thank you for these suggestions, Adam.

--dependency is already covered in the SLURM guide, but you're making a good connection with the fault-tolerance section.

And for the file dropping I used the concept of the kill switch so it's there already.

so I combined both of your suggestions and pushed this:

a6e0f21

closing this for now - but please don't hesitate to continue if more things can be improved.