gocraft / work

Context

github.com/gocraft/work v0.5.1
github.com/gomodule/redigo v1.8.4
Several(5 - 6) workers in pods
Max concurrency = 200

Expected behavior

Workers proceed jobs. The jobs queue keeps around zero.

Corrent behavior

All processes are alive but no one processes jobs.
The job queue starts growing
There is no case when workers start processing again
It can be fixed by flushing redis db
value of current lock at this moment is much greater than max concurrency (workers:jobs:process:lock = 20519 > 200)

How to reproduce

Prepare a small app

package main

import (
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/gocraft/work"
	"github.com/gomodule/redigo/redis"
)

var redisPool = &redis.Pool{
	MaxActive: 5,
	MaxIdle:   5,
	Wait:      true,
	Dial: func() (redis.Conn, error) {
		return redis.Dial("tcp", ":6379", redis.DialDatabase(3))
	},
}

const (
	namespace = "qwe"
	jobname = "asd"
	concurrency = 10
)

var enqueuer = work.NewEnqueuer(namespace, redisPool)

type ctx struct {
}

func (*ctx) do(_ *work.Job) error {
	fmt.Println("i'm alive", time.Now())
	time.Sleep(time.Millisecond * 5)
	_, err := enqueuer.Enqueue(jobname, nil)
	return err
}

func main() {
	for i := 0; i < concurrency * 2; i++ {
		_, err := enqueuer.Enqueue(jobname, nil)
		if err != nil {
			log.Fatal(err)
		}
	}
	work.
		NewWorkerPool(ctx{}, concurrency, namespace, redisPool).
		JobWithOptions(jobname, work.JobOptions{MaxConcurrency: 200}, (*ctx).do).
		Start()
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, os.Interrupt, os.Kill)
	<-signalChan
}

go mod vendor
apply the following diff (allow us not to wait long for a race between dead pool reaper and an alive pool)

diff --git a/vendor/github.com/gocraft/work/dead_pool_reaper.go b/vendor/github.com/gocraft/work/dead_pool_reaper.go
index e930521e..4e1d4534 100644
--- a/vendor/github.com/gocraft/work/dead_pool_reaper.go
+++ b/vendor/github.com/gocraft/work/dead_pool_reaper.go
@@ -10,9 +10,9 @@ import (
 )
  
 const (
-       deadTime          = 10 * time.Second // 2 x heartbeat
-       reapPeriod        = 10 * time.Minute
-       reapJitterSecs    = 30
+       deadTime          = 1 * time.Second // 2 x heartbeat
+       reapPeriod        = 1 * time.Second
+       reapJitterSecs    = 1
        requeueKeysPerJob = 4
 )

run two or more instances of the program simultaneously
in some time both workers will stop working
check the lock value via redis-cli (get qwe:jobs:asd:lock)

Detailed description

Reapers

There is a reaper goroutine that checks heartbeats if workers are alive every 30 seconds.
if the heartbeat for a pool id found and not expired, go to the next pool id
if the heartbeat is expired for the pool id id jobs from the pool in-progress moved to the queue
after heartbeat removing or if no heartbeat found, the pool id is removed from worker_pools
lock is adjusted by the number from lock_info for the pool id and it's removed from there and if lock is less than zero, the lock is set to zero

The problem

Here is a race condition

race - two reapers for a different dead pool ids
crutch - set lock to zero if it's negative after adjusting by lock_info
- the first reaper sets lock to -N
- the second one - to -N - M
- the first reaper adjust lock to -M (by subtracting -N) and as lock is negative set it to 0
- the second reaper adjust lock to M

Possible Solution

do not adjust lock if it's less than zero (

work/redis.go

Line 214 in 5959e69

redis.call('set', lock, 0)

)
remove dead pool id from worker_pools only after adjusting lock
- move this
  
  work/dead_pool_reaper.go
  
  Line 103 in 5959e69
  
  if err = r.cleanStaleLockInfo(deadPoolID, lockJobTypes); err != nil {
- move this before this
  
  work/dead_pool_reaper.go
  
  Line 99 in 5959e69
  
  if _, err = conn.Do("SREM", workerPoolsKey, deadPoolID); err != nil {

Hi, This is very well reported. Thanks a lot.

Am facing the similar issue as well.
I can see your PR was also not merged. Did you found any other workarounds to solve this?

Workers stuck

Context

Expected behavior

Corrent behavior

How to reproduce

Detailed description

Reapers

The problem

Possible Solution