SGE Alarm State

Question

SGE Alarm State

briandoconnor opened this issue 10 years ago · comments

Hi Amish,

Can you add to your bug queue the following. :-)

SGE will put nodes in an alarm state if the load on the system is too high. And I think the load limit will be something like the number of cores. You will probably want to adjust this so that it’s 1.2x the core or something that allows the system to work pretty hard (but not excessively). The reason I bring this up is I’m seeing the VirtualBox VM go into alarm state:

Every 2.0s: qstat -f Thu Jul 17 23:36:01 2014

queuename qtype resv/used/tot. load_avg arch states

main.q@master BIP 0/0/2 5.94 lx26-amd64 a

PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
27 0.50000 bwa_mem_1_ vagrant qw 07/17/2014 23:16:03 1

And this means subsequent jobs will wait until the load average falls. This could delay our workflows if SGE constantly does this and could, ultimately, result in a lot of idle computers waiting for their load to fall.

Anyway, this is something to look into… perhaps it’s best to just make it a param so we can tweak as needed. Let me know if you have questions.

Brian

Amish Patel · Answer 1 · Fri Jul 18 2014 21:12:15 GMT+0800 (China Standard Time)

Isn't it normal to have the subsequent jobs wait when a node is under a lot of load? I looked into this a little and found information like this: http://www.rocksclusters.org/roll-documentation/sge/4.2.1/monitoring-sge.html

I don't have the required knowledge to understand how severe this problem is. Do you want me to look into it a bit more? Also, for increasing the cores, is it SGE related or do I need to tweak it in the Vagrant_file or something? I am working on the other SGE issue right now and then, I will hop onto this one.

Denis Yuen · Answer 2 · Thu Sep 11 2014 22:24:08 GMT+0800 (China Standard Time)

I think it actually makes sense for the load average target to stay at 1*(# of cores).
A higher load average is actually more inefficient.
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

Stuart Watt · Answer 3 · Thu Sep 11 2014 23:07:59 GMT+0800 (China Standard Time)

Magic trick I just discovered that stops Virtualbox I/O disasters, FYI:

  config.vm.provider "virtualbox" do |v|
    v.memory = 2048
    v.cpus = 2
    v.customize ['storagectl', :id, "--name", "SATA Controller", '--hostiocache', 'off']
  end

Turning off hostiocache can make a huge difference for Virtualbox. Without it, I/O can get to a level where it kills the system, even without mucking about beyond a single core. I saw this with database restores, for example.