job with multiple groups problem
numiralofe opened this issue · comments
hi All,
Env: sherpa: 0.2.0 / nomad 0.9.5
Problem: I have a nomad job where i have multiple groups defined, some with sherpa meta to enable scaling policies and other groups without any scaling police, also, sherpa policies are defined inside the group block.
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline
dynamic true 1 1 0 1 2019-10-11T10:24:38Z
static true 1 1 0 1 2019-10-11T10:24:38Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
82adab39 7144d68b dynamic 0 run running 2m44s ago 3s ago
f2b5b883 8efd65cc static 0 run running 2m44s ago 3s ago
If i curl the api i can see that the scaling police for the dynamic bit is properly set:
curl localhost:9000/v1/policies
{"jobcheckout":{"dynamic":{"Enabled":true,"MinCount":1,"MaxCount":4,"ScaleOutCount":1,"ScaleInCount":1,"ScaleOutCPUPercentageThreshold":85,"ScaleOutMemoryPercentageThreshold":85,"ScaleInCPUPercentageThreshold":30,"ScaleInMemoryPercentageThreshold":30}}}
but after putting some load on the job, i am getting the following on the sherpa logs:
{"time":"2019-10-11T10:28:36.350750688Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:28:36.350815881Z","message":"worker with func exits from panic: goroutine 575 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x0, 0x405585, 0x42db9c, 0xc0001435f8, 0x0, 0xc0001f09a0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001887e0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"level":"debug","time":"2019-10-11T10:29:14.365681892Z","message":"meta watcher last index has not changed"}
{"level":"debug","time":"2019-10-11T10:29:17.735299209Z","message":"deployment watcher last index has not changed"}
{"time":"2019-10-11T10:29:36.351848672Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:29:36.351949427Z","message":"worker with func exits from panic: goroutine 612 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x405585, 0x42db9c, 0xc00013f5f8, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0002f4d80)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:30:36.352331770Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:30:36.352406052Z","message":"worker with func exits from panic: goroutine 639 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x0, 0x0, 0x0, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001e1580)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:31:36.350696901Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
@numiralofe thanks for the report, I will look into this as soon as possible.
hi,
Thanks for the update :) updated to 0.2.1 but I was running tests again and I am finding the same problem :( being that now with logs in debug i can't see any panic error message, but still with jobs with more that 1 group, the groups that have the scaling nomad metada are not working, just for sanity I confirmed that putting back 1 group works as expected.
Thanks
@numiralofe are you able to provide an example job file to reproduce this? The trace included in the issue pointed to the exact code changed, so I am keen to learn where this stems from and fix it once and for all for you.
You say you're not seeing the panic, which is one problem solved, but now you're not seeing scaling when there should be an event? If you have any debug logs around this period that would also help.
@jrasell i am really sorry for my mistake and you are absolutely right :)
i did one more test and it works as expected and the bug is fixed, i think that probably when I was looking at the WebUI and the logs, somehow I got confused and missed the scaling event.
As a result i have also submitted a small pull request adding that info on the WebUi, i think that from the operator perspective its really useful have on the WebUi which direction the event took.
Thanks once again and sorry for the confusion.
@numiralofe I appreciate you checking; its often rare in OSS to get this kind of follow up so thanks and I am happy we managed to get the issue sorted. Thanks also for the PR, i'll get onto that now!
@jrasell no worries :) i am to use sherpa in a prd setup also I am from "the time" when "giving back" to the community and improve it yourself was a regular practice :)