Stalled/failed jobs stay in queue

Question

Stalled/failed jobs stay in queue

wangkunmeng opened this issue 8 months ago · comments

Description

Stalled/failed jobs stay in queue. This problem has been fixed in bullmq. Is there any plan to fix this problem in bull?

taskforcesh/bullmq#1171
taskforcesh/bullmq@38486cb

Bull version

4.10.2

Manuel Astudillo · Answer 1 · Tue Oct 10 2023 00:04:44 GMT+0800 (China Standard Time)

Can you be more specific about what issue is present in Bull and how to reproduce it?

WangKunMeng · Answer 2 · Tue Oct 10 2023 15:22:27 GMT+0800 (China Standard Time)

import Bull from 'bull';
import { setTimeout } from 'timers/promises';
import { randomUUID } from 'crypto';

const queue = new Bull('running-stalled-job-' + randomUUID(), {
    redis: { port: 6379, host: '127.0.0.1' },
    defaultJobOptions: {
        attempts: 2,
        removeOnComplete: true,
        removeOnFail: true
    },
    settings: {
        lockRenewTime: 2500,
        lockDuration: 250,
        stalledInterval: 1500,
        maxStalledCount: 1
    }
});

let processedCount = 0;

queue.process(() => {
    processedCount++;
    return setTimeout(8000);
});

queue.on('completed', () => {
    console.error(new Error('should not complete'));
});

queue.on('failed', (job, err) => {
    /**
     * failed processedCount: 2 job stalled more than allowable limit /Users/tiger/Desktop/workspace/ts_study/node_modules/_bull@4.11.3@bull/lib/queue.js:1020
            new Error('job stalled more than allowable limit'),
            ^
        Error: job stalled more than allowable limit
            at <anonymous> (/Users/tiger/Desktop/workspace/ts_study/node_modules/_bull@4.11.3@bull/lib/queue.js:1020:13)
            at processTicksAndRejections (node:internal/process/task_queues:96:5)
            at async Promise.all (index 0)
     */
    console.log('failed', 'processedCount: ' + processedCount, job.failedReason, err);
    process.exit();
});

queue.add({ foo: 'bar' }, { jobId: 'test' });

I cannot add another job which id is test. Can job be removed when stalled after retries?

Manuel Astudillo · Answer 3 · Tue Oct 10 2023 22:04:56 GMT+0800 (China Standard Time)

Can you explain what you are trying to achieve with the settings "lockRenewTime, lockDuration and stalledIntervall" ? the values you choose are going to create problems for sure as your lock duration is much lower than your lockRenewTime. Most likely you will not need to change these settings at all.

evgi99 · Answer 4 · Wed Oct 18 2023 15:40:14 GMT+0800 (China Standard Time)

I guess I am experiencing the same issue.
In my use case, I have inserted a couple of jobs into the queue(let's say job1 & job2). Then I applied moveToFailed on the waiting job (job2).
My expectation was that after the process was finished with job1, he would not start to process job2.
I'm wondering if this is planned to be fixed or if it is by design.

Versions:

"@nestjs/bull": "10.0.1"
"bull": "4.11.3"

Manuel Astudillo · Answer 5 · Wed Oct 18 2023 17:52:42 GMT+0800 (China Standard Time)

@evgi99 it would be better if you provided some code to back your issue as it is difficult for me to understand what you are actually doing.

evgi99 · Answer 6 · Wed Oct 18 2023 20:01:31 GMT+0800 (China Standard Time)

@manast Sure. I'll try to demonstrate:

this is my Processor

@Processor('job-queue')
export class WorkerProcessor {
  logger;
  constructor(@InjectQueue('job-queue') private readonly jobsQueue: Queue) {
    this.logger = new Logger(WorkerProcessor.name);
  }

  @Process({
    name: 'calculate-job',
  })
  async handleJob(job: Job<newJobDTO>) {
    await new Promise((resolve) => setTimeout(resolve, 15000));
    const state = await job.getState();
    if (state === 'failed')  return -1;
    return job.data.Nth * 10;
  }

  @OnQueueCompleted()
  async onCompleted(job: Job, result: any) {
    this.logger.log(`Complete handling job ${job.id} with result ${result}`);
  }

  @OnQueueFailed()
  async handlerFailedJob(job: string, err: string) {
    try {
      if (err === 'job canceled by user') {
        throw new Error(`Job ${job} cancelled`);
      } else {
        throw new Error(`Job ${job} failed`);
      }
    } catch (e) {
      this.logger.error(e.message);
    }
  }

  @OnGlobalQueueFailed()
  async onGlobalFailedhandler(job: string) {
    const runingJob = await this.jobsQueue.getJob(job);
    await runingJob.discard();
    this.handlerFailedJob(job, runingJob.failedReason);
  }

  @OnQueueActive()
  async onActive(job: Job<newJobDTO>) {
    this.logger.log(
      `Start processing id=${job.id}, jobData=${JSON.stringify(job.data)}`,
    );
  }
}

And in the controller, In the POST I pushed 2 jobs and canceled the second one:

@Post()
  async enQueue(): Promise<{ jobIds: string[] }> {
    const job1 = await this.jobsQueue.add('calculate-job', { Nth: 10 });
    await new Promise((resolve) => setTimeout(resolve, 500)); // 0.5 second wait
    const job2 = await this.jobsQueue.add('calculate-job', { Nth: 12 });
    await new Promise((resolve) => setTimeout(resolve, 2000)); // 2 second wait
    const jobObj = await this.jobsQueue.getJob(job2.id.toString());
    const jobStatus = await jobObj.getState();
    if (jobStatus === 'active' || jobStatus === 'waiting') {
      await jobObj.moveToFailed({ message: 'job canceled by user' }, true);
    }

    return { jobIds: [job1.id.toString(), job2.id.toString()]};
  }

As you can see it starts consuming the second job immediately after finishing the first one:

Another thing that is strange for me: during the handling function, the state of job2 is 'failed'. I Checked it with adding condition + return -1

Manuel Astudillo · Answer 7 · Wed Oct 18 2023 20:32:55 GMT+0800 (China Standard Time)

You cannot "cancel" a job like that, if the job is being processed, only the worker can move it to fail.

evgi99 · Answer 8 · Thu Oct 19 2023 03:14:20 GMT+0800 (China Standard Time)

Thanks @manast for the quick response.

Here the job is not being processed when it moveToFailed, The job waiting in a queue.
In my case, after the process is finished, the "failed" job appears in [bull:job-queue:completed] as-well-as in[bull:job-queue:failed]and it does not make sense to me. My pain is that I have an API to retrieve JobStatus by jobId that returns "completed" with the result while I expected to see "failed" with the failedReason.

Do we have an alternate way to mark waiting job as "failed"(with failedReason) on demand? (and dequeue the job from the queue)

Manuel Astudillo · Answer 9 · Thu Oct 19 2023 04:40:11 GMT+0800 (China Standard Time)

How do you know the job is actually waiting and not active at the moment you call moveToFailed ?
I don't know what criteria are you using to decide which jobs you want to manually "moveToFailed", but why not move that criteria inside the processor function and just throw with a custom error message when the criteria is met?

stale · Answer 10 · Mon Dec 18 2023 06:43:30 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.