OSC / ood_core

Open OnDemand core library

Home Page:https://osc.github.io/ood_core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LSF bjobs for all users hangs

vallerul opened this issue · comments

Active jobs app, for all users will hang when LSF runs thousands of jobs, and the active history in LSF is kept for days instead of hours. CLEAN_PERIOD in LSF configuration controls how much data bjobs retrieve. CLEAN_PERIOD is usually a day, but when increased to 3 days , it caused a forever hang.
I see that the issue is because of bjobs arguments in lib/ood_core/job/adapters/lsf/batch.rb :

def get_jobs_for_user(user) args = %W( -u #{user} -a -w -W ) parse_bjobs_output(call("bjobs", *args)) end
bjobs -u all -a -w -W is very resource intensive when thousands of jobs are scheduled, and almost can take forever to return.

I had to make the following change ( remove -a ) to make it respond:

def get_jobs_for_user(user) args = %W( -u #{user} -w -W ) parse_bjobs_output(call("bjobs", *args)) end

It would be good to keep the above configurable, instead of making the change in code.

┆Issue is synchronized with this Asana task by Unito

Thanks. Looks like it's here

args = %W( -u #{user} -a -w -W )

and here

args = %W( -a -w -W #{id.to_s} )

Does bjobs respond to the environment variable CLEAN_PERIOD?

As far as I remember, it cannot be used as an environment variable.
CLEAN_PERIOD is part of lsb.params configuration file, and is usually set as part of scheduler policies.

https://www.ibm.com/support/pages/how-increase-default-retention-job-information-lsf-memory.

hmmm ok. yea it seems like we could default to false (not using the flag) and folks can enable it if they choose.

I can't recall what torque did, but Slurm doesn't keep job info around for very long.