Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout for stuck workers #65

Open
gbhrdt opened this issue Aug 29, 2018 · 6 comments · May be fixed by #87
Open

Timeout for stuck workers #65

gbhrdt opened this issue Aug 29, 2018 · 6 comments · May be fixed by #87

Comments

@gbhrdt
Copy link

gbhrdt commented Aug 29, 2018

It can happen that workers get stuck silently. We were using node-resque worker before, which handled this scenario very well. With goworker, jobs just keep shown as running in resque-web.
Also, the worker count in resque-web keeps increasing (should be 4).

screen shot 2018-08-29 at 13 26 38

@mingan
Copy link

mingan commented Aug 29, 2018

Are you restarting the goworker application? We see this behaviour when the application is hard-stopped and doesn't have time to cleanup the records in Redis (workers and worker:[node] keys).

@gbhrdt
Copy link
Author

gbhrdt commented Aug 29, 2018

@mingan Yes, sometimes we are re-deploying the Docker containers when jobs are still running, that might be the cause. I think we should definitely cleanup when starting up again.

Edit:
node-resque does something like this to cleanup:

const shutdown = async () => {
    await scheduler.end();
    await worker.end();
    process.exit();
  };

  process.on('SIGTERM', shutdown);
  process.on('SIGINT', shutdown);

@mingan
Copy link

mingan commented Aug 29, 2018

Yeah, the problem is to figure out a safe mechanism to do so and keep it compatible with the Resque gem.

@gbhrdt
Copy link
Author

gbhrdt commented Aug 29, 2018

We don't have any disadvantage from those jobs other than memory consumption from Redis, right? So concurrency still works as expected and the stuck jobs are not being considered by goworker anymore?

@mingan
Copy link

mingan commented Aug 29, 2018

If it's the same issue we have experienced, there are extra values in the set of workers and the dead workers appear to still be working in the UI (there are records under the given prefix). I'm not sure if the jobs themselves are failed or abandoned, that might be an issue.

There's similar code in goworker https://github.com/benmanns/goworker/blob/master/signals.go which stops polling and stops idle workers. I don't remember it correctly and don't have time to look it up at the moment but I think it doesn't force a running worker to stop so unless it finishes normally, it might hang.

@xescugc
Copy link
Contributor

xescugc commented Feb 3, 2021

This logic should be added like it's on the "main" Resque: https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L599

Which basically consists on having a heartbeat and a prune function when the worker is started which will expire old workers.

I'll try to work on this and add it to the lib, would this be something that would be merged if implemented? (cc @benmanns)

@xescugc xescugc linked a pull request Jun 4, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants