Replies: 2 comments
-
This is not supported natively. You need to update the status of that particular job in the database manually before restarting it. The reason for not having such a feature natively is the fact that Spring Batch cannot, by just looking at the status of the job in the database, distinguish a job that is effectively running from one that failed abruptly. There is a note about this case in the docs here. I also covered this topic here. I think one clean way to address this with Kubernetes is to develop a custom job controller that updates the database accordingly when a pod fails. A fresh pod will then restart the job from where it left off without any issue. This feature would be awesome as we would have "self-healing" jobs. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your answer @fmbenhassine I understand that spring-batch can't handle automatic retry or something more "intelligent" because there is no knowledge of the business and what can be done. But I'm really surprise that there is no API in the JobOperator that allow to easily perform required operation to perform an "retake" on a manager partition job At the end, everybody that want to "retake" a partition manager job at one moment need to rewrite hardcoded query based on model which we are not owner. That's of course a solution, but it sound more like a workaround for me. From a resilient point of view, it look like our architecture is really nice to never loose something, but because of the waiting nature of the partition job, I need to have a stable server for the launcher of the partition job. It look really a bad point for me. Basically, I have multiple listener on rabbitMQ that take a "task" event that need to be executed (I write task and not job because it's our internal model, decoupled from spring-batch). But when it's a partition, the server of the partition can't fail, otherwise the next execution take a "already running exception". I understand why this exception append, and it make sens for your framework. But why not allow to explicitly "retake" the execution of the partition manager, instead of throw the exception.. ? It's not a restart, it's not a relaunch, it's just a "continue your manager waiting over the partition". With such solution, it will be far more easier to implement resilient job execution on top of resilient AMQP, which look the standard way to implement resilience on this, IMHO NOTE: Important to really thing this feature from the manager point of view of a remote partitioned job |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently using Spring Batch with remote partitioning and facing resilience challenges on the manager side.
The setup is straightforward: I have a job composed of multiple partitioned steps, distributed via RabbitMQ using persistent queues. On the worker side, resilience is simple to achieve—if a pod crashes, the message remains in the queue and is picked up by another healthy pod. Each reader is restartable, thanks to the use of CounterReader, and everything works smoothly.
On the manager side, job execution is also triggered via a persistent RabbitMQ queue, as part of a broader job lifecycle. When a pod crashes and the job is retried by another pod, it often results in a JobExecutionAlreadyRunningException, which is expected.
My question is: what's the recommended strategy for handling this scenario?
I've seen discussions like this, but I'm hesitant to resort to hardcoded database queries outside the Spring Batch API to resolve the issue.
I understand what’s happening internally and can manually clean things up when necessary. But what I really need is a way to tell Spring Batch: “Take back control of this job, I promise no one else is running it.”
If Spring Batch doesn’t support this natively, what's the cleanest, most robust way to achieve it without breaking encapsulation or relying on custom SQL?
Beta Was this translation helpful? Give feedback.
All reactions