Timeout for stuck workers #65

gbhrdt · 2018-08-29T11:27:15Z

It can happen that workers get stuck silently. We were using node-resque worker before, which handled this scenario very well. With goworker, jobs just keep shown as running in resque-web.
Also, the worker count in resque-web keeps increasing (should be 4).

The text was updated successfully, but these errors were encountered:

mingan · 2018-08-29T11:29:59Z

Are you restarting the goworker application? We see this behaviour when the application is hard-stopped and doesn't have time to cleanup the records in Redis (workers and worker:[node] keys).

gbhrdt · 2018-08-29T11:32:14Z

@mingan Yes, sometimes we are re-deploying the Docker containers when jobs are still running, that might be the cause. I think we should definitely cleanup when starting up again.

Edit:
node-resque does something like this to cleanup:

const shutdown = async () => {
    await scheduler.end();
    await worker.end();
    process.exit();
  };

  process.on('SIGTERM', shutdown);
  process.on('SIGINT', shutdown);

mingan · 2018-08-29T11:36:48Z

Yeah, the problem is to figure out a safe mechanism to do so and keep it compatible with the Resque gem.

gbhrdt · 2018-08-29T11:47:45Z

We don't have any disadvantage from those jobs other than memory consumption from Redis, right? So concurrency still works as expected and the stuck jobs are not being considered by goworker anymore?

mingan · 2018-08-29T11:54:45Z

If it's the same issue we have experienced, there are extra values in the set of workers and the dead workers appear to still be working in the UI (there are records under the given prefix). I'm not sure if the jobs themselves are failed or abandoned, that might be an issue.

There's similar code in goworker https://github.com/benmanns/goworker/blob/master/signals.go which stops polling and stops idle workers. I don't remember it correctly and don't have time to look it up at the moment but I think it doesn't force a running worker to stop so unless it finishes normally, it might hang.

xescugc · 2021-02-03T12:16:09Z

This logic should be added like it's on the "main" Resque: https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L599

Which basically consists on having a heartbeat and a prune function when the worker is started which will expire old workers.

I'll try to work on this and add it to the lib, would this be something that would be merged if implemented? (cc @benmanns)

xescugc linked a pull request Jun 4, 2021 that will close this issue

worker: Add logic for heatbeat and prune #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout for stuck workers #65

Timeout for stuck workers #65

gbhrdt commented Aug 29, 2018

mingan commented Aug 29, 2018

gbhrdt commented Aug 29, 2018 •

edited

Loading

mingan commented Aug 29, 2018

gbhrdt commented Aug 29, 2018

mingan commented Aug 29, 2018

xescugc commented Feb 3, 2021

Timeout for stuck workers #65

Timeout for stuck workers #65

Comments

gbhrdt commented Aug 29, 2018

mingan commented Aug 29, 2018

gbhrdt commented Aug 29, 2018 • edited Loading

mingan commented Aug 29, 2018

gbhrdt commented Aug 29, 2018

mingan commented Aug 29, 2018

xescugc commented Feb 3, 2021

gbhrdt commented Aug 29, 2018 •

edited

Loading