Custom parallel job queue
andnp opened this issue · comments
Gnu-parallel is no longer cutting it. A few issues:
- Need to pass custom signals on to children. These signals are always getting eaten by parallel.
- Need finer-grained control over ssh processes. Would love to avoid
srun
where possible due to extreme cost in interacting with the scheduler. Should be able to replacesrun
with custom ssh + environment build scripts. - Challenging to have homogeneity across different compute backends. Likely losing access to compute canada soon, but don't want these scripts to go to waste! Need some homogenous way to make use of beowulf cluster.
Notes:
- Need a way to get num_cpus and hostnames from slurm when allocated across nodes.
- Need a way to use MIG when available.
- Need a way to batch jobs within a process (e.g. so one sub-process can handle a batch)
- Nice-to-have: mark a parameter as "batchable". For instance, we can
jax.vmap
over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.