AER-4181 Added a startup guard to the schedulers#115
Merged
Hilbrand merged 3 commits intoaerius:mainfrom Mar 20, 2026
Merged
Conversation
This guard will block a scheduler from starting if there are still tasks on the worker queue. This will help when the taskmanager is restart during operation and there were still tasks on the queue. When the taskmanager is restart it looses the information on what tasks are still in progress and therefore operates as if there are no tasks. This will result in adding new tasks on the worker queue even while it should wait till the tasks that are already on the queue. After restart it also skewed the load metrics. Because it would report as no load, while there would be tasks on the queue. This could result in scaling down because it looked like there was nothing to do. With this change the taskmanager will wait before doing anything before the worker queue is actual empty, starting fresh. It also won't report any new load metrics until that moment. This will keep the old load metric last send active, which should result in no action being taken by the scaling mechanism.
Member
BertScholten
left a comment
There was a problem hiding this comment.
not too sure about this, it might backfire a bit:
if there is one task on the chunker-worker queue that will take a day to process, and taskmanager resets, then taskmanager won't handle any new tasks until that long-running task is done? Think it might be better if some scaling does occur at that point (which, if it's busy, shouldn't happen anyway, as there should be a bit of a wait period anyway)?
Member
Author
|
@BertScholten I realized that also. It could block if a long running task is still on the queue. So I'll improve on this and find a better solution. For now I've put the pr on draft. I'll update when I'm finished. |
… that were still on the queue Stopping at startup when there are still messages on the queue can potential block when a very long task is still in progress. Therefore changed implementation to not block on waiting for reaching 0, but do block until the first time an update from RabbitMQ has been retrieved. By waiting for that information the taskmanager can be initialized with the initial number of messages that are still on the queue. The WorkerPool and the Metrics reporter can than initialize itself before actual scheduling is started, and it can than handle the tasks already on the queue when they are finished.
BertScholten
approved these changes
Mar 18, 2026
.../taskmanager/src/test/java/nl/aerius/taskmanager/metrics/PerformanceMetricsReporterTest.java
Outdated
Show resolved
Hide resolved
source/taskmanager/src/test/java/nl/aerius/taskmanager/WorkerPoolTest.java
Outdated
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The StartUp guard will wait till the first information from RabbitMQ is received about number of messages of the worker queue. This information is used to initialize the WorkerPool/Load Metrics.
If the Taskmanager would be started while there were still messages on the worker queue it will take those messages into account in determine if new messages can be put on the queue and in calculating the load. That way it won't report incorrect information while there would be still messages on the queue. This situation only happens when the Taskmanager is started when there are messages on the worker queue. Because the load information can be used for scaling, having an incorrect load value could result in undesired scaling triggers (i.e. when the messages already on the queue are not taken into account the load reports as if there are fewer workers active, which could trigger a scale down as the system things it's less busy. Potentially killing tasks running on the queue. Especially for long running tasks this could result in the task being restarted)