Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After update to bullmq: Error: could not renew lock for job #3056

Open
1 task done
Twisterking opened this issue Feb 5, 2025 · 19 comments
Open
1 task done

After update to bullmq: Error: could not renew lock for job #3056

Twisterking opened this issue Feb 5, 2025 · 19 comments

Comments

@Twisterking
Copy link

Version

v5.34.2

Platform

NodeJS

What happened?

We used the predecessor bull successfully and very heavily over many months at our company.
Now, we updated to bullmq and, to be honest, we have quite some issues.

Our queues get stuck quite frequently (we never had this issue!), and we sometimes run into this error:

Error: could not renew lock for job xyz

This just continues like that, it never resolves, until we do a restart. This is quite bad for us.

I also could not really find something here in any other issues? What can we do here?

How to reproduce.

No response

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Twisterking Twisterking added the bug Something isn't working label Feb 5, 2025
@manast
Copy link
Contributor

manast commented Feb 5, 2025

As far as I know BullMQ is as stable as Bull having a much larger test suite, so in general you should not have more issues, but less. However it is possible that during migration you made some assumptions on how BullMQ works which may not hold true coming from Bull. The best would be if you could post a case that reproduces those issues so we can give you hints or look deeper into it if it happens to be a bug.

Furthermore, you mentioned that you run into a given error. That error is only produced via an event, and is triggered only if a lock cannot be renewed for a given job, this is quite unusual, so probably related to the migration work. I also wonder, are you using typescript?

@melihplt
Copy link

melihplt commented Feb 6, 2025

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

@roggervalf
Copy link
Collaborator

roggervalf commented Feb 6, 2025

hi folks, just for curiosity, how did you migrate to bullmq from bull? Did you create new queues for bullmq or did you use a different prefix?

@Twisterking
Copy link
Author

Twisterking commented Feb 6, 2025

Hello everyone,

No, we do not use typescript, just vanilla JS.
The thing is, that we continue to run into this issue. We now even implemented these 2 options for our workers:

{
  maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
  stalledInterval: 1 * 60 * 1000 // 1 minute
}

... and we continue to have this issue. We need to restart the whole node instance (docker container) to make it startup again.

We run into the failed event with the error: Error: could not renew lock for job xyz.

We are already trying to not overload the CPU as best as we can. Of course we "fluctuate around 100%", but this, to me, does not mean that we really leave ZERO headroom for the CPU to even renew the lock. :/

Migration:

We are using a different prefix (bullmq). So we kinda discontinued the old bull queue and deployed all our instances in the right order to make the transition as smooth as possible.

@manast
Copy link
Contributor

manast commented Feb 6, 2025

9 out of 10 the errors of this nature steams in wrong passing of options or arguments when not using Typescript, specially coming from Bull which does not have the same signatures.

It is difficult to asses if your issue is related to high CPU usage, as you mentioned that you are sometimes up in 100%. Without more information about the specifics of your use case and some test case that shows the problem we really have not a lot of chances to help you.

@manast manast added cannot reproduce and removed bug Something isn't working labels Feb 6, 2025
@manast
Copy link
Contributor

manast commented Feb 6, 2025

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

It is highly unlikely that you are having "the same issue", specially when we do not even know yet what the issue it. So please if you have an issue, post a reproducible case in a new issue and we will look into it.

@Twisterking
Copy link
Author

Twisterking commented Feb 6, 2025

Thanks for the reply @manast . Could you please add more details on the "wrong passing of options"?

I don't understand how some "code bug" on our end could trigger this very error?
My understanding was, that the Error: could not renew lock for job xyz error in 90% of cases should actually not happen, but IF it does, it is most often triggered by a stalled job. Maybe I got this wrong?

Sidenote: We do use TS checks in our VS code setup and do not get any errors for like "wrong passed options" to e.g. Queue or Worker or something like that. Us passing completely wrong options somewhere seems unlikely to me.

On that note: we do have a queueEvent on the stalled event and do NOT see the logging of it.

It is almost impossible for me to give you a reproduction example, also because of the randomness how/when the issue occurs for us.

Our usecase:

We use the queue to connect our main (MeteorJs) App to our workers (plain nodejs apps). These workers handle huge amounts of data imports. Basically all our jobs in the queue consist of reading data from files or APIs, create mongoDB update bulk operations, and running these bulk operations on our mongoDb.

@manast
Copy link
Contributor

manast commented Feb 6, 2025

But when this happens, what is the status of the job that could not renew the lock?

@Twisterking
Copy link
Author

Twisterking commented Feb 6, 2025

We will implemented some more logging today and get back to you. Thanks a lot for your responsiveness, highly appreciated!

We do use bullboard, and for some reason, I can not find these jobIds in our "failed list". So I am also confused, where these jobs disappear to.

@manast
Copy link
Contributor

manast commented Feb 6, 2025

How many jobs do you usually run concurrently?

@Twisterking
Copy link
Author

On these workers, only 1!
We do have 2 docker containers, but both only run 1 job at a time. so "in total" you could say, 2 jobs might run concurrently, but the 2 jobs run in separate docker containers on separate node processes.

@manast
Copy link
Contributor

manast commented Feb 6, 2025

Are these jobs blocking NodeJS event loop? Did you try using sandboxes instead?

@melihplt
Copy link

melihplt commented Feb 6, 2025

@Twisterking to find a pattern, I want to ask if you have the same setup with me.

  • Are you using Heroku or some container?
  • Do you add new jobs inside a worker?
  • Do you connect a websocket inside the worker?

@Twisterking
Copy link
Author

Quick update from my end:

It looks like that, indeed, we identified some nested for loops and such, that block the event loop.
It just took us a very long time to find them. 😬

Will report back when I know (even) more!

@melihplt

  • we use self hosted docker containers on AWS EC2
  • yes, we do add jobs inside the workers
  • actually yes we do. We have Meteor's DDP connection in place to connect to our main server.

@melihplt
Copy link

melihplt commented Feb 9, 2025

According to some logging, in my case, the job stucks at connecting to Discord via socket, "sometimes". But I don't understand why I cannot force worker process to be killed. I will dig more too. Thanks @Twisterking .

@manast
Copy link
Contributor

manast commented Feb 9, 2025

@melihplt could it be a bug in NodeJS where the connection enters an infinite loop? Have you tried with a different runtime such as Bun to see if you get the same result?

@Twisterking
Copy link
Author

Hello again,

I have some updates!
We could improve the situation a bit, but we still run into the could not renew lock error very frequently.

What we do not understand at all, is this: We have the following 2 settings set on ALL our workers of the affected queue:

{
      maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
      stalledInterval: 3 * 60 * 1000 // 3 minutes
} 

We did just realize, that there are also 2 more options we did NOT change and are therefore set to default: lockDuration and lockRenewTime.

But even with that, so having e.g. lockDuration set to 30 seconds, how is it possible that we see the Error in our logs that often (see the timestamp at the very left!):

Image

I please ask for your input! We need to give our workers enough time to NOT make the jobs stall. We DO have some parts in the code that sometimes lock node's event loop over 30 seconds, which is fine for us.

But we seem confused about which settings we need to make this possible.

@manast
Copy link
Contributor

manast commented Feb 10, 2025

Why don't you use sandboxed processors which are precisely designed for handling cases where you keep the nodejs event loop busy?

@Twisterking
Copy link
Author

@manast

I tried this back then with bull and ran into huge issues.
Since bull was an "oldschool" require() package, we ran into issues inside our "type": "module" ESM node app.

For this, and other reasons, I would like to avoid doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants