After update to bullmq: Error: could not renew lock for job #3056

Twisterking · 2025-02-05T15:17:23Z

Version

v5.34.2

Platform

NodeJS

What happened?

We used the predecessor bull successfully and very heavily over many months at our company.
Now, we updated to bullmq and, to be honest, we have quite some issues.

Our queues get stuck quite frequently (we never had this issue!), and we sometimes run into this error:

Error: could not renew lock for job xyz

This just continues like that, it never resolves, until we do a restart. This is quite bad for us.

I also could not really find something here in any other issues? What can we do here?

How to reproduce.

No response

Relevant log output

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

manast · 2025-02-05T16:40:47Z

As far as I know BullMQ is as stable as Bull having a much larger test suite, so in general you should not have more issues, but less. However it is possible that during migration you made some assumptions on how BullMQ works which may not hold true coming from Bull. The best would be if you could post a case that reproduces those issues so we can give you hints or look deeper into it if it happens to be a bug.

Furthermore, you mentioned that you run into a given error. That error is only produced via an event, and is triggered only if a lock cannot be renewed for a given job, this is quite unusual, so probably related to the migration work. I also wonder, are you using typescript?

melihplt · 2025-02-06T00:27:54Z

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

roggervalf · 2025-02-06T02:38:08Z

hi folks, just for curiosity, how did you migrate to bullmq from bull? Did you create new queues for bullmq or did you use a different prefix?

Twisterking · 2025-02-06T08:56:36Z

Hello everyone,

No, we do not use typescript, just vanilla JS.
The thing is, that we continue to run into this issue. We now even implemented these 2 options for our workers:

{
  maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
  stalledInterval: 1 * 60 * 1000 // 1 minute
}

... and we continue to have this issue. We need to restart the whole node instance (docker container) to make it startup again.

We run into the failed event with the error: Error: could not renew lock for job xyz.

We are already trying to not overload the CPU as best as we can. Of course we "fluctuate around 100%", but this, to me, does not mean that we really leave ZERO headroom for the CPU to even renew the lock. :/

Migration:

We are using a different prefix (bullmq). So we kinda discontinued the old bull queue and deployed all our instances in the right order to make the transition as smooth as possible.

manast · 2025-02-06T09:10:02Z

9 out of 10 the errors of this nature steams in wrong passing of options or arguments when not using Typescript, specially coming from Bull which does not have the same signatures.

It is difficult to asses if your issue is related to high CPU usage, as you mentioned that you are sometimes up in 100%. Without more information about the specifics of your use case and some test case that shows the problem we really have not a lot of chances to help you.

manast · 2025-02-06T09:13:20Z

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

It is highly unlikely that you are having "the same issue", specially when we do not even know yet what the issue it. So please if you have an issue, post a reproducible case in a new issue and we will look into it.

Twisterking · 2025-02-06T09:16:00Z

Thanks for the reply @manast . Could you please add more details on the "wrong passing of options"?

I don't understand how some "code bug" on our end could trigger this very error?
My understanding was, that the Error: could not renew lock for job xyz error in 90% of cases should actually not happen, but IF it does, it is most often triggered by a stalled job. Maybe I got this wrong?

Sidenote: We do use TS checks in our VS code setup and do not get any errors for like "wrong passed options" to e.g. Queue or Worker or something like that. Us passing completely wrong options somewhere seems unlikely to me.

On that note: we do have a queueEvent on the stalled event and do NOT see the logging of it.

It is almost impossible for me to give you a reproduction example, also because of the randomness how/when the issue occurs for us.

Our usecase:

We use the queue to connect our main (MeteorJs) App to our workers (plain nodejs apps). These workers handle huge amounts of data imports. Basically all our jobs in the queue consist of reading data from files or APIs, create mongoDB update bulk operations, and running these bulk operations on our mongoDb.

manast · 2025-02-06T09:24:41Z

But when this happens, what is the status of the job that could not renew the lock?

Twisterking · 2025-02-06T09:30:10Z

We will implemented some more logging today and get back to you. Thanks a lot for your responsiveness, highly appreciated!

We do use bullboard, and for some reason, I can not find these jobIds in our "failed list". So I am also confused, where these jobs disappear to.

manast · 2025-02-06T09:34:01Z

How many jobs do you usually run concurrently?

Twisterking · 2025-02-06T09:37:25Z

On these workers, only 1!
We do have 2 docker containers, but both only run 1 job at a time. so "in total" you could say, 2 jobs might run concurrently, but the 2 jobs run in separate docker containers on separate node processes.

manast · 2025-02-06T10:11:01Z

Are these jobs blocking NodeJS event loop? Did you try using sandboxes instead?

melihplt · 2025-02-06T11:04:36Z

@Twisterking to find a pattern, I want to ask if you have the same setup with me.

Are you using Heroku or some container?
Do you add new jobs inside a worker?
Do you connect a websocket inside the worker?

Twisterking · 2025-02-06T16:06:23Z

Quick update from my end:

It looks like that, indeed, we identified some nested for loops and such, that block the event loop.
It just took us a very long time to find them. 😬

Will report back when I know (even) more!

@melihplt

we use self hosted docker containers on AWS EC2
yes, we do add jobs inside the workers
actually yes we do. We have Meteor's DDP connection in place to connect to our main server.

melihplt · 2025-02-09T13:33:12Z

According to some logging, in my case, the job stucks at connecting to Discord via socket, "sometimes". But I don't understand why I cannot force worker process to be killed. I will dig more too. Thanks @Twisterking .

manast · 2025-02-09T21:27:40Z

@melihplt could it be a bug in NodeJS where the connection enters an infinite loop? Have you tried with a different runtime such as Bun to see if you get the same result?

Twisterking · 2025-02-10T08:46:32Z

Hello again,

I have some updates!
We could improve the situation a bit, but we still run into the could not renew lock error very frequently.

What we do not understand at all, is this: We have the following 2 settings set on ALL our workers of the affected queue:

{
      maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
      stalledInterval: 3 * 60 * 1000 // 3 minutes
}

We did just realize, that there are also 2 more options we did NOT change and are therefore set to default: lockDuration and lockRenewTime.

But even with that, so having e.g. lockDuration set to 30 seconds, how is it possible that we see the Error in our logs that often (see the timestamp at the very left!):

I please ask for your input! We need to give our workers enough time to NOT make the jobs stall. We DO have some parts in the code that sometimes lock node's event loop over 30 seconds, which is fine for us.

But we seem confused about which settings we need to make this possible.

manast · 2025-02-10T10:26:13Z

Why don't you use sandboxed processors which are precisely designed for handling cases where you keep the nodejs event loop busy?

Twisterking · 2025-02-10T12:06:44Z

@manast

I tried this back then with bull and ran into huge issues.
Since bull was an "oldschool" require() package, we ran into issues inside our "type": "module" ESM node app.

For this, and other reasons, I would like to avoid doing this.

Twisterking added the bug Something isn't working label Feb 5, 2025

manast added cannot reproduce and removed bug Something isn't working labels Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After update to bullmq: Error: could not renew lock for job #3056

After update to bullmq: Error: could not renew lock for job #3056

Twisterking commented Feb 5, 2025

manast commented Feb 5, 2025 •

edited

Loading

melihplt commented Feb 6, 2025

roggervalf commented Feb 6, 2025 •

edited

Loading

Twisterking commented Feb 6, 2025 •

edited

Loading

manast commented Feb 6, 2025

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025 •

edited

Loading

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025 •

edited

Loading

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025

manast commented Feb 6, 2025

melihplt commented Feb 6, 2025

Twisterking commented Feb 6, 2025

melihplt commented Feb 9, 2025

manast commented Feb 9, 2025

Twisterking commented Feb 10, 2025

manast commented Feb 10, 2025

Twisterking commented Feb 10, 2025

After update to bullmq: Error: could not renew lock for job #3056

After update to bullmq: Error: could not renew lock for job #3056

Comments

Twisterking commented Feb 5, 2025

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

manast commented Feb 5, 2025 • edited Loading

melihplt commented Feb 6, 2025

roggervalf commented Feb 6, 2025 • edited Loading

Twisterking commented Feb 6, 2025 • edited Loading

manast commented Feb 6, 2025

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025 • edited Loading

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025 • edited Loading

manast commented Feb 6, 2025

Twisterking commented Feb 6, 2025

manast commented Feb 6, 2025

melihplt commented Feb 6, 2025

Twisterking commented Feb 6, 2025

melihplt commented Feb 9, 2025

manast commented Feb 9, 2025

Twisterking commented Feb 10, 2025

manast commented Feb 10, 2025

Twisterking commented Feb 10, 2025

manast commented Feb 5, 2025 •

edited

Loading

roggervalf commented Feb 6, 2025 •

edited

Loading

Twisterking commented Feb 6, 2025 •

edited

Loading

Twisterking commented Feb 6, 2025 •

edited

Loading

Twisterking commented Feb 6, 2025 •

edited

Loading