jrasell / sherpa

Sherpa is a highly available, fast, and flexible horizontal job scaling for HashiCorp Nomad. It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

job with multiple groups problem

numiralofe opened this issue · comments

hi All,

Env: sherpa: 0.2.0 / nomad 0.9.5

Problem: I have a nomad job where i have multiple groups defined, some with sherpa meta to enable scaling policies and other groups without any scaling police, also, sherpa policies are defined inside the group block.

Deployed
Task Group           Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
dynamic  true         1        1       0        1          2019-10-11T10:24:38Z
static   true         1        1       0        1          2019-10-11T10:24:38Z

Allocations
ID        Node ID   Task Group           Version  Desired  Status   Created    Modified
82adab39  7144d68b  dynamic  0        run      running  2m44s ago  3s ago
f2b5b883  8efd65cc  static   0        run      running  2m44s ago  3s ago

If i curl the api i can see that the scaling police for the dynamic bit is properly set:

curl localhost:9000/v1/policies
{"jobcheckout":{"dynamic":{"Enabled":true,"MinCount":1,"MaxCount":4,"ScaleOutCount":1,"ScaleInCount":1,"ScaleOutCPUPercentageThreshold":85,"ScaleOutMemoryPercentageThreshold":85,"ScaleInCPUPercentageThreshold":30,"ScaleInMemoryPercentageThreshold":30}}}

but after putting some load on the job, i am getting the following on the sherpa logs:

{"time":"2019-10-11T10:28:36.350750688Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:28:36.350815881Z","message":"worker with func exits from panic: goroutine 575 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x0, 0x405585, 0x42db9c, 0xc0001435f8, 0x0, 0xc0001f09a0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001887e0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"level":"debug","time":"2019-10-11T10:29:14.365681892Z","message":"meta watcher last index has not changed"}
{"level":"debug","time":"2019-10-11T10:29:17.735299209Z","message":"deployment watcher last index has not changed"}
{"time":"2019-10-11T10:29:36.351848672Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:29:36.351949427Z","message":"worker with func exits from panic: goroutine 612 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x405585, 0x42db9c, 0xc00013f5f8, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0002f4d80)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:30:36.352331770Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:30:36.352406052Z","message":"worker with func exits from panic: goroutine 639 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x0, 0x0, 0x0, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001e1580)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:31:36.350696901Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}

@numiralofe thanks for the report, I will look into this as soon as possible.

hi,

Thanks for the update :) updated to 0.2.1 but I was running tests again and I am finding the same problem :( being that now with logs in debug i can't see any panic error message, but still with jobs with more that 1 group, the groups that have the scaling nomad metada are not working, just for sanity I confirmed that putting back 1 group works as expected.

Thanks

@numiralofe are you able to provide an example job file to reproduce this? The trace included in the issue pointed to the exact code changed, so I am keen to learn where this stems from and fix it once and for all for you.

You say you're not seeing the panic, which is one problem solved, but now you're not seeing scaling when there should be an event? If you have any debug logs around this period that would also help.

@jrasell i am really sorry for my mistake and you are absolutely right :)

i did one more test and it works as expected and the bug is fixed, i think that probably when I was looking at the WebUI and the logs, somehow I got confused and missed the scaling event.

As a result i have also submitted a small pull request adding that info on the WebUi, i think that from the operator perspective its really useful have on the WebUi which direction the event took.

Thanks once again and sorry for the confusion.

@numiralofe I appreciate you checking; its often rare in OSS to get this kind of follow up so thanks and I am happy we managed to get the issue sorted. Thanks also for the PR, i'll get onto that now!

@jrasell no worries :) i am to use sherpa in a prd setup also I am from "the time" when "giving back" to the community and improve it yourself was a regular practice :)