Some feedback and outline of possible next steps
willirath opened this issue · comments
I've been playing with this over there willirath/dask_jobqueue_workshop_materials#6 and can give some feedback.
-
Thanks for kicking this off, @lesteve! We're just a few steps away from running no-requirements training on dask jobqueue now.
-
This works on Pangeo binder! As pangeo's binder (currently) offers more ressources to the user, we can do meaningful computations with a SLURM cluster running on the same VM as the notebook server.
-
Make the whole
slurm.conf
part of the repo and explicitlyCOPY
it tot the Docker image. This way, it might be easier to have a SLURM admin chime in and help. -
In a final setting, this would lead to a look-and-feel similar to the https://examples.dask.org binder (labview plugin and juputer_server_plugin).
-
Could use help of somebody experienced with administrating a SLURM setup:
- Jobs don't seem to stop (or be very slow at it) when I kill the scheduler.
- After canceling jobs, new jobs don't start with
FrontEndDown
being given as reason.