gocraft / work

Process background jobs in Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Workers stuck

d1slike opened this issue · comments

Context

  • github.com/gocraft/work v0.5.1
  • github.com/gomodule/redigo v1.8.4
  • Several(5 - 6) workers in pods
  • Max concurrency = 200

Expected behavior

  • Workers proceed jobs. The jobs queue keeps around zero.

Corrent behavior

  • All processes are alive but no one processes jobs.
  • The job queue starts growing
  • There is no case when workers start processing again
  • It can be fixed by flushing redis db
  • value of current lock at this moment is much greater than max concurrency (workers:jobs:process:lock = 20519 > 200)

How to reproduce

  • Prepare a small app
package main

import (
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/gocraft/work"
	"github.com/gomodule/redigo/redis"
)

var redisPool = &redis.Pool{
	MaxActive: 5,
	MaxIdle:   5,
	Wait:      true,
	Dial: func() (redis.Conn, error) {
		return redis.Dial("tcp", ":6379", redis.DialDatabase(3))
	},
}

const (
	namespace = "qwe"
	jobname = "asd"
	concurrency = 10
)

var enqueuer = work.NewEnqueuer(namespace, redisPool)

type ctx struct {
}

func (*ctx) do(_ *work.Job) error {
	fmt.Println("i'm alive", time.Now())
	time.Sleep(time.Millisecond * 5)
	_, err := enqueuer.Enqueue(jobname, nil)
	return err
}

func main() {
	for i := 0; i < concurrency * 2; i++ {
		_, err := enqueuer.Enqueue(jobname, nil)
		if err != nil {
			log.Fatal(err)
		}
	}
	work.
		NewWorkerPool(ctx{}, concurrency, namespace, redisPool).
		JobWithOptions(jobname, work.JobOptions{MaxConcurrency: 200}, (*ctx).do).
		Start()
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, os.Interrupt, os.Kill)
	<-signalChan
}
  • go mod vendor
  • apply the following diff (allow us not to wait long for a race between dead pool reaper and an alive pool)
diff --git a/vendor/github.com/gocraft/work/dead_pool_reaper.go b/vendor/github.com/gocraft/work/dead_pool_reaper.go
index e930521e..4e1d4534 100644
--- a/vendor/github.com/gocraft/work/dead_pool_reaper.go
+++ b/vendor/github.com/gocraft/work/dead_pool_reaper.go
@@ -10,9 +10,9 @@ import (
 )
  
 const (
-       deadTime          = 10 * time.Second // 2 x heartbeat
-       reapPeriod        = 10 * time.Minute
-       reapJitterSecs    = 30
+       deadTime          = 1 * time.Second // 2 x heartbeat
+       reapPeriod        = 1 * time.Second
+       reapJitterSecs    = 1
        requeueKeysPerJob = 4
 )
  • run two or more instances of the program simultaneously
  • in some time both workers will stop working
  • check the lock value via redis-cli (get qwe:jobs:asd:lock)

Detailed description

Reapers

  1. There is a reaper goroutine that checks heartbeats if workers are alive every 30 seconds.
  2. if the heartbeat for a pool id found and not expired, go to the next pool id
  3. if the heartbeat is expired for the pool id id jobs from the pool in-progress moved to the queue
  4. after heartbeat removing or if no heartbeat found, the pool id is removed from worker_pools
  5. lock is adjusted by the number from lock_info for the pool id and it's removed from there and if lock is less than zero, the lock is set to zero

The problem

Here is a race condition

  • race - two reapers for a different dead pool ids
  • crutch - set lock to zero if it's negative after adjusting by lock_info
    • the first reaper sets lock to -N
    • the second one - to -N - M
    • the first reaper adjust lock to -M (by subtracting -N) and as lock is negative set it to 0
    • the second reaper adjust lock to M

Possible Solution

  • do not adjust lock if it's less than zero (

    work/redis.go

    Line 214 in 5959e69

    redis.call('set', lock, 0)
    )
  • remove dead pool id from worker_pools only after adjusting lock

Hi, This is very well reported. Thanks a lot.

Am facing the similar issue as well.
I can see your PR was also not merged. Did you found any other workarounds to solve this?