andnp / PyExpUtils

Experiment utility code, specifically designed for use with Compute Canada.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom parallel job queue

andnp opened this issue · comments

Gnu-parallel is no longer cutting it. A few issues:

  • Need to pass custom signals on to children. These signals are always getting eaten by parallel.
  • Need finer-grained control over ssh processes. Would love to avoid srun where possible due to extreme cost in interacting with the scheduler. Should be able to replace srun with custom ssh + environment build scripts.
  • Challenging to have homogeneity across different compute backends. Likely losing access to compute canada soon, but don't want these scripts to go to waste! Need some homogenous way to make use of beowulf cluster.

Notes:

  • Need a way to get num_cpus and hostnames from slurm when allocated across nodes.
  • Need a way to use MIG when available.
  • Need a way to batch jobs within a process (e.g. so one sub-process can handle a batch)
  • Nice-to-have: mark a parameter as "batchable". For instance, we can jax.vmap over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.