LSF bjobs for all users hangs
vallerul opened this issue · comments
Active jobs app, for all users will hang when LSF runs thousands of jobs, and the active history in LSF is kept for days instead of hours. CLEAN_PERIOD in LSF configuration controls how much data bjobs retrieve. CLEAN_PERIOD is usually a day, but when increased to 3 days , it caused a forever hang.
I see that the issue is because of bjobs arguments in lib/ood_core/job/adapters/lsf/batch.rb :
def get_jobs_for_user(user) args = %W( -u #{user} -a -w -W ) parse_bjobs_output(call("bjobs", *args)) end
bjobs -u all -a -w -W is very resource intensive when thousands of jobs are scheduled, and almost can take forever to return.
I had to make the following change ( remove -a ) to make it respond:
def get_jobs_for_user(user) args = %W( -u #{user} -w -W ) parse_bjobs_output(call("bjobs", *args)) end
It would be good to keep the above configurable, instead of making the change in code.
┆Issue is synchronized with this Asana task by Unito
Thanks. Looks like it's here
and here
Does bjobs
respond to the environment variable CLEAN_PERIOD
?
As far as I remember, it cannot be used as an environment variable.
CLEAN_PERIOD is part of lsb.params configuration file, and is usually set as part of scheduler policies.
https://www.ibm.com/support/pages/how-increase-default-retention-job-information-lsf-memory.
hmmm ok. yea it seems like we could default to false (not using the flag) and folks can enable it if they choose.
I can't recall what torque did, but Slurm doesn't keep job info around for very long.