dynamic rate limiting of job submissions?
BenWibking opened this issue · comments
On the cluster I'm using, there is a hard limit of 36 jobs per user that are running or pending in the SLURM queue.
However, I need to run a 200 parameter study. Is there any workaround for this other than splitting this large study up into studies of <= 36 parameters?
It would be ideal if it were possible for the conductor process to wait until jobs complete and then submit new jobs.
Following the route described in the docs here (https://maestrowf.readthedocs.io/en/latest/Maestro/how_to_guides/running_with_flux.html#launch-maestro-external-to-the-batch-jobflux-broker) seems like the best option for my use-case.
I've managed to install Flux via Spack on this cluser. The one remaining issue is that I have to wait until the SLURM job starts before I can do maestro run
on the login node.
If I wanted to modify the Maestro conductor code so it polls SLURM to see whether the Flux broker job has started, where should I start to do that? Is this feasible?
Hi @BenWibking -- one thing to note is that maestro run
also has a throttle option, you could limit the jobs to 36 there. Do keep in mind that is a universal limit between local and scheduled steps. So if you have a lot of local steps ahead of submitted steps, you will artificially limit yourself there.
Hi @BenWibking -- one thing to note is that
maestro run
also has a throttle option, you could limit the jobs to 36 there. Do keep in mind that is a universal limit between local and scheduled steps. So if you have a lot of local steps ahead of submitted steps, you will artificially limit yourself there.
Adding --throttle 36
solves the problem and works perfectly.
I was a bit thrown off by the wording in the documentation for the --throttle
option. It might help to clarify that it refers to the total number of jobs in the (external, non-Maestro) scheduler queue (both running and pending), rather than only those that are actually executing.
I checked the status of this study today and it seems to have stopped submitting new jobs to SLURM.
maestro status
reports that several dozen steps are PENDING and dozens more are INITIALIZED, but nothing is in the SLURM queue. Maybe this is related to #441?
The last log entry is:
2024-05-05 11:40:01,492 - maestrowf.conductor:monitor_study:349 - INFO - Checking DAG status at 2024-05-05 11:40:01.492025
2024-05-05 11:40:01,597 - maestrowf.datastructures.core.executiongraph:check_study_status:963 - INFO - Jobs found for user 'bwibking'.
2024-05-05 11:40:01,598 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:916 - INFO - Found 0 available slots...
This full log for this study is here:
medres_compressive.log.zip
The conductor process for this is still running:
login4.stampede3(1011)$ ps aux | grep $USER
bwibking 3092832 0.0 0.0 20660 11896 ? Ss May04 0:01 /usr/lib/systemd/systemd --user
bwibking 3092835 0.0 0.0 202568 6948 ? S May04 0:00 (sd-pam)
bwibking 3093950 0.0 0.0 7264 3472 ? S May04 0:00 /bin/sh -c nohup conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253 > /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253/medres_compressive.txt 2>&1
bwibking 3093951 0.2 0.0 328808 72948 ? S May04 2:17 /scratch/projects/compilers/intel24.0/oneapi/intelpython/python3.9/bin/python3.9 /home1/02661/bwibking/.local/bin/conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240504-193253
root 3993349 0.0 0.0 39960 12012 ? Ss 11:32 0:00 sshd: bwibking [priv]
bwibking 3993762 0.0 0.0 40144 7516 ? S 11:33 0:00 sshd: bwibking@pts/73
bwibking 3993765 0.0 0.0 18048 6128 pts/73 Ss 11:33 0:00 -bash
bwibking 3998925 0.0 0.0 19236 3652 pts/73 R+ 11:39 0:00 ps aux
bwibking 3998926 0.0 0.0 6432 2336 pts/73 S+ 11:39 0:00 grep --color=auto bwibking
This seems to reliably happen for studies that I run on this machine.
This issue seems to be the same as #441, and that has more informative logs, so I'll close this.