OptimalBits / bull

Premium Queue package for handling distributed jobs and messages in NodeJS.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stalled/failed jobs stay in queue

wangkunmeng opened this issue · comments

Description

Stalled/failed jobs stay in queue. This problem has been fixed in bullmq. Is there any plan to fix this problem in bull?

taskforcesh/bullmq#1171
taskforcesh/bullmq@38486cb

Bull version

4.10.2

Can you be more specific about what issue is present in Bull and how to reproduce it?

import Bull from 'bull';
import { setTimeout } from 'timers/promises';
import { randomUUID } from 'crypto';

const queue = new Bull('running-stalled-job-' + randomUUID(), {
    redis: { port: 6379, host: '127.0.0.1' },
    defaultJobOptions: {
        attempts: 2,
        removeOnComplete: true,
        removeOnFail: true
    },
    settings: {
        lockRenewTime: 2500,
        lockDuration: 250,
        stalledInterval: 1500,
        maxStalledCount: 1
    }
});

let processedCount = 0;

queue.process(() => {
    processedCount++;
    return setTimeout(8000);
});

queue.on('completed', () => {
    console.error(new Error('should not complete'));
});

queue.on('failed', (job, err) => {
    /**
     * failed processedCount: 2 job stalled more than allowable limit /Users/tiger/Desktop/workspace/ts_study/node_modules/_bull@4.11.3@bull/lib/queue.js:1020
            new Error('job stalled more than allowable limit'),
            ^
        Error: job stalled more than allowable limit
            at <anonymous> (/Users/tiger/Desktop/workspace/ts_study/node_modules/_bull@4.11.3@bull/lib/queue.js:1020:13)
            at processTicksAndRejections (node:internal/process/task_queues:96:5)
            at async Promise.all (index 0)
     */
    console.log('failed', 'processedCount: ' + processedCount, job.failedReason, err);
    process.exit();
});

queue.add({ foo: 'bar' }, { jobId: 'test' });

I cannot add another job which id is test. Can job be removed when stalled after retries?

Can you explain what you are trying to achieve with the settings "lockRenewTime, lockDuration and stalledIntervall" ? the values you choose are going to create problems for sure as your lock duration is much lower than your lockRenewTime. Most likely you will not need to change these settings at all.

I guess I am experiencing the same issue.
In my use case, I have inserted a couple of jobs into the queue(let's say job1 & job2). Then I applied moveToFailed on the waiting job (job2).
My expectation was that after the process was finished with job1, he would not start to process job2.
I'm wondering if this is planned to be fixed or if it is by design.

Versions:

  • "@nestjs/bull": "10.0.1"
  • "bull": "4.11.3"

@evgi99 it would be better if you provided some code to back your issue as it is difficult for me to understand what you are actually doing.

@manast Sure. I'll try to demonstrate:

this is my Processor

@Processor('job-queue')
export class WorkerProcessor {
  logger;
  constructor(@InjectQueue('job-queue') private readonly jobsQueue: Queue) {
    this.logger = new Logger(WorkerProcessor.name);
  }

  @Process({
    name: 'calculate-job',
  })
  async handleJob(job: Job<newJobDTO>) {
    await new Promise((resolve) => setTimeout(resolve, 15000));
    const state = await job.getState();
    if (state === 'failed')  return -1;
    return job.data.Nth * 10;
  }

  @OnQueueCompleted()
  async onCompleted(job: Job, result: any) {
    this.logger.log(`Complete handling job ${job.id} with result ${result}`);
  }

  @OnQueueFailed()
  async handlerFailedJob(job: string, err: string) {
    try {
      if (err === 'job canceled by user') {
        throw new Error(`Job ${job} cancelled`);
      } else {
        throw new Error(`Job ${job} failed`);
      }
    } catch (e) {
      this.logger.error(e.message);
    }
  }

  @OnGlobalQueueFailed()
  async onGlobalFailedhandler(job: string) {
    const runingJob = await this.jobsQueue.getJob(job);
    await runingJob.discard();
    this.handlerFailedJob(job, runingJob.failedReason);
  }

  @OnQueueActive()
  async onActive(job: Job<newJobDTO>) {
    this.logger.log(
      `Start processing id=${job.id}, jobData=${JSON.stringify(job.data)}`,
    );
  }
}

And in the controller, In the POST I pushed 2 jobs and canceled the second one:

@Post()
  async enQueue(): Promise<{ jobIds: string[] }> {
    const job1 = await this.jobsQueue.add('calculate-job', { Nth: 10 });
    await new Promise((resolve) => setTimeout(resolve, 500)); // 0.5 second wait
    const job2 = await this.jobsQueue.add('calculate-job', { Nth: 12 });
    await new Promise((resolve) => setTimeout(resolve, 2000)); // 2 second wait
    const jobObj = await this.jobsQueue.getJob(job2.id.toString());
    const jobStatus = await jobObj.getState();
    if (jobStatus === 'active' || jobStatus === 'waiting') {
      await jobObj.moveToFailed({ message: 'job canceled by user' }, true);
    }

    return { jobIds: [job1.id.toString(), job2.id.toString()]};
  }

As you can see it starts consuming the second job immediately after finishing the first one:
image

Another thing that is strange for me: during the handling function, the state of job2 is 'failed'. I Checked it with adding condition + return -1

You cannot "cancel" a job like that, if the job is being processed, only the worker can move it to fail.

Thanks @manast for the quick response.

  1. Here the job is not being processed when it moveToFailed, The job waiting in a queue.

  2. In my case, after the process is finished, the "failed" job appears in [bull:job-queue:completed] as-well-as in[bull:job-queue:failed]and it does not make sense to me. My pain is that I have an API to retrieve JobStatus by jobId that returns "completed" with the result while I expected to see "failed" with the failedReason.

Do we have an alternate way to mark waiting job as "failed"(with failedReason) on demand? (and dequeue the job from the queue)

How do you know the job is actually waiting and not active at the moment you call moveToFailed ?
I don't know what criteria are you using to decide which jobs you want to manually "moveToFailed", but why not move that criteria inside the processor function and just throw with a custom error message when the criteria is met?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.