Restart Remote Partitionning #4831

Nhoutain · 2025-05-07T16:52:43Z

Nhoutain
May 7, 2025

I'm currently using Spring Batch with remote partitioning and facing resilience challenges on the manager side.

The setup is straightforward: I have a job composed of multiple partitioned steps, distributed via RabbitMQ using persistent queues. On the worker side, resilience is simple to achieve—if a pod crashes, the message remains in the queue and is picked up by another healthy pod. Each reader is restartable, thanks to the use of CounterReader, and everything works smoothly.

On the manager side, job execution is also triggered via a persistent RabbitMQ queue, as part of a broader job lifecycle. When a pod crashes and the job is retried by another pod, it often results in a JobExecutionAlreadyRunningException, which is expected.

My question is: what's the recommended strategy for handling this scenario?

I've seen discussions like this, but I'm hesitant to resort to hardcoded database queries outside the Spring Batch API to resolve the issue.

I understand what’s happening internally and can manually clean things up when necessary. But what I really need is a way to tell Spring Batch: “Take back control of this job, I promise no one else is running it.”

If Spring Batch doesn’t support this natively, what's the cleanest, most robust way to achieve it without breaking encapsulation or relying on custom SQL?

Answered by fmbenhassine

Jun 3, 2025

This is not supported natively. You need to update the status of that particular job in the database manually before restarting it.

The reason for not having such a feature natively is the fact that Spring Batch cannot, by just looking at the status of the job in the database, distinguish a job that is effectively running from one that failed abruptly. There is a note about this case in the docs here. I also covered this topic here.

I think one clean way to address this with Kubernetes is to develop a custom job controller that updates the database accordingly when a pod fails. A fresh pod will then restart the job from where it left off without any issue. This feature would be awesome as…

View full answer

fmbenhassine · 2025-06-03T11:41:36Z

fmbenhassine
Jun 3, 2025
Maintainer

This is not supported natively. You need to update the status of that particular job in the database manually before restarting it.

The reason for not having such a feature natively is the fact that Spring Batch cannot, by just looking at the status of the job in the database, distinguish a job that is effectively running from one that failed abruptly. There is a note about this case in the docs here. I also covered this topic here.

I think one clean way to address this with Kubernetes is to develop a custom job controller that updates the database accordingly when a pod fails. A fresh pod will then restart the job from where it left off without any issue. This feature would be awesome as we would have "self-healing" jobs.

0 replies

Nhoutain · 2025-06-03T15:28:44Z

Nhoutain
Jun 3, 2025
Author

Thanks for your answer @fmbenhassine

I understand that spring-batch can't handle automatic retry or something more "intelligent" because there is no knowledge of the business and what can be done.

But I'm really surprise that there is no API in the JobOperator that allow to easily perform required operation to perform an "retake" on a manager partition job

At the end, everybody that want to "retake" a partition manager job at one moment need to rewrite hardcoded query based on model which we are not owner. That's of course a solution, but it sound more like a workaround for me.

From a resilient point of view, it look like our architecture is really nice to never loose something, but because of the waiting nature of the partition job, I need to have a stable server for the launcher of the partition job. It look really a bad point for me.

Basically, I have multiple listener on rabbitMQ that take a "task" event that need to be executed (I write task and not job because it's our internal model, decoupled from spring-batch). But when it's a partition, the server of the partition can't fail, otherwise the next execution take a "already running exception".

I understand why this exception append, and it make sens for your framework. But why not allow to explicitly "retake" the execution of the partition manager, instead of throw the exception.. ? It's not a restart, it's not a relaunch, it's just a "continue your manager waiting over the partition".

With such solution, it will be far more easier to implement resilient job execution on top of resilient AMQP, which look the standard way to implement resilience on this, IMHO

NOTE: Important to really thing this feature from the manager point of view of a remote partitioned job

0 replies

fmbenhassine · 2025-06-10T09:50:52Z

fmbenhassine
Jun 10, 2025
Maintainer

I understand your point. This is basically the classic leader election problem in distributed systems, but this is not implemented in Spring Batch. A partitioned job is seen as a regular job that should be restarted if failed.

But I'm really surprise that there is no API in the JobOperator that allow to easily perform required operation to perform an "retake" on a manager partition job

That can be fixed, or more precisely, improved. I will open a feature request to add this to the JobOperator API and cc you.

I think I answered your initial question, so I am going to close this discussion. But if this is not the case, please add a comment with more details and I would be happy to help further.

1 reply

fmbenhassine Jun 10, 2025
Maintainer

@Nhoutain I created #4876 to add a method for that in the JobOperator API. Thank you for reporting this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restart Remote Partitionning #4831

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Restart Remote Partitionning #4831

Uh oh!

Nhoutain May 7, 2025

Replies: 3 comments · 1 reply

Uh oh!

fmbenhassine Jun 3, 2025 Maintainer

Uh oh!

Uh oh!

Nhoutain Jun 3, 2025 Author

Uh oh!

fmbenhassine Jun 10, 2025 Maintainer

Uh oh!

fmbenhassine Jun 10, 2025 Maintainer

Nhoutain
May 7, 2025

Replies: 3 comments 1 reply

fmbenhassine
Jun 3, 2025
Maintainer

Nhoutain
Jun 3, 2025
Author

fmbenhassine
Jun 10, 2025
Maintainer

fmbenhassine Jun 10, 2025
Maintainer