Workers stuck
d1slike opened this issue · comments
Context
- github.com/gocraft/work v0.5.1
- github.com/gomodule/redigo v1.8.4
- Several(5 - 6) workers in pods
- Max concurrency = 200
Expected behavior
- Workers proceed jobs. The jobs queue keeps around zero.
Corrent behavior
- All processes are alive but no one processes jobs.
- The job queue starts growing
- There is no case when workers start processing again
- It can be fixed by flushing redis db
- value of current lock at this moment is much greater than max concurrency (workers:jobs:process:lock = 20519 > 200)
How to reproduce
- Prepare a small app
package main
import (
"fmt"
"log"
"os"
"os/signal"
"time"
"github.com/gocraft/work"
"github.com/gomodule/redigo/redis"
)
var redisPool = &redis.Pool{
MaxActive: 5,
MaxIdle: 5,
Wait: true,
Dial: func() (redis.Conn, error) {
return redis.Dial("tcp", ":6379", redis.DialDatabase(3))
},
}
const (
namespace = "qwe"
jobname = "asd"
concurrency = 10
)
var enqueuer = work.NewEnqueuer(namespace, redisPool)
type ctx struct {
}
func (*ctx) do(_ *work.Job) error {
fmt.Println("i'm alive", time.Now())
time.Sleep(time.Millisecond * 5)
_, err := enqueuer.Enqueue(jobname, nil)
return err
}
func main() {
for i := 0; i < concurrency * 2; i++ {
_, err := enqueuer.Enqueue(jobname, nil)
if err != nil {
log.Fatal(err)
}
}
work.
NewWorkerPool(ctx{}, concurrency, namespace, redisPool).
JobWithOptions(jobname, work.JobOptions{MaxConcurrency: 200}, (*ctx).do).
Start()
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, os.Interrupt, os.Kill)
<-signalChan
}
go mod vendor
- apply the following diff (allow us not to wait long for a race between dead pool reaper and an alive pool)
diff --git a/vendor/github.com/gocraft/work/dead_pool_reaper.go b/vendor/github.com/gocraft/work/dead_pool_reaper.go
index e930521e..4e1d4534 100644
--- a/vendor/github.com/gocraft/work/dead_pool_reaper.go
+++ b/vendor/github.com/gocraft/work/dead_pool_reaper.go
@@ -10,9 +10,9 @@ import (
)
const (
- deadTime = 10 * time.Second // 2 x heartbeat
- reapPeriod = 10 * time.Minute
- reapJitterSecs = 30
+ deadTime = 1 * time.Second // 2 x heartbeat
+ reapPeriod = 1 * time.Second
+ reapJitterSecs = 1
requeueKeysPerJob = 4
)
- run two or more instances of the program simultaneously
- in some time both workers will stop working
- check the lock value via redis-cli (
get qwe:jobs:asd:lock
)
Detailed description
Reapers
- There is a reaper goroutine that checks heartbeats if workers are alive every 30 seconds.
- if the heartbeat for a pool id found and not expired, go to the next pool id
- if the heartbeat is expired for the pool id id jobs from the pool in-progress moved to the queue
- after heartbeat removing or if no heartbeat found, the pool id is removed from worker_pools
- lock is adjusted by the number from lock_info for the pool id and it's removed from there and if lock is less than zero, the lock is set to zero
The problem
Here is a race condition
- race - two reapers for a different dead pool ids
- crutch - set lock to zero if it's negative after adjusting by lock_info
- the first reaper sets lock to -N
- the second one - to -N - M
- the first reaper adjust lock to -M (by subtracting -N) and as lock is negative set it to 0
- the second reaper adjust lock to M
Possible Solution
Hi, This is very well reported. Thanks a lot.
Am facing the similar issue as well.
I can see your PR was also not merged. Did you found any other workarounds to solve this?