Skip to content

AER-4181 Added a startup guard to the schedulers#115

Merged
Hilbrand merged 3 commits intoaerius:mainfrom
Hilbrand:startup-guard
Mar 20, 2026
Merged

AER-4181 Added a startup guard to the schedulers#115
Hilbrand merged 3 commits intoaerius:mainfrom
Hilbrand:startup-guard

Conversation

@Hilbrand
Copy link
Member

@Hilbrand Hilbrand commented Mar 13, 2026

The StartUp guard will wait till the first information from RabbitMQ is received about number of messages of the worker queue. This information is used to initialize the WorkerPool/Load Metrics.
If the Taskmanager would be started while there were still messages on the worker queue it will take those messages into account in determine if new messages can be put on the queue and in calculating the load. That way it won't report incorrect information while there would be still messages on the queue. This situation only happens when the Taskmanager is started when there are messages on the worker queue. Because the load information can be used for scaling, having an incorrect load value could result in undesired scaling triggers (i.e. when the messages already on the queue are not taken into account the load reports as if there are fewer workers active, which could trigger a scale down as the system things it's less busy. Potentially killing tasks running on the queue. Especially for long running tasks this could result in the task being restarted)

This guard will block a scheduler from starting if there are still tasks on the worker queue.
This will help when the taskmanager is restart during operation and there were still tasks on the queue.
When the taskmanager is restart it looses the information on what tasks are still in progress and therefore operates as if there are no tasks.
This will result in adding new tasks on the worker queue even while it should wait till the tasks that are already on the queue.
After restart it also skewed the load metrics. Because it would report as no load, while there would be tasks on the queue.
This could result in scaling down because it looked like there was nothing to do.
With this change the taskmanager will wait before doing anything before the worker queue is actual empty, starting fresh.
It also won't report any new load metrics until that moment. This will keep the old load metric last send active, which should result in no action being taken by the scaling mechanism.
Copy link
Member

@BertScholten BertScholten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not too sure about this, it might backfire a bit:

if there is one task on the chunker-worker queue that will take a day to process, and taskmanager resets, then taskmanager won't handle any new tasks until that long-running task is done? Think it might be better if some scaling does occur at that point (which, if it's busy, shouldn't happen anyway, as there should be a bit of a wait period anyway)?

@Hilbrand Hilbrand marked this pull request as draft March 13, 2026 17:02
@Hilbrand
Copy link
Member Author

@BertScholten I realized that also. It could block if a long running task is still on the queue. So I'll improve on this and find a better solution. For now I've put the pr on draft. I'll update when I'm finished.

… that were still on the queue

Stopping at startup when there are still messages on the queue can potential block when a very long task is still in progress.
Therefore changed implementation to not block on waiting for reaching 0, but do block until the first time an update from RabbitMQ has been retrieved.
By waiting for that information the taskmanager can be initialized with the initial number of messages that are still on the queue.
The WorkerPool and the Metrics reporter can than initialize itself before actual scheduling is started,
and it can than handle the tasks already on the queue when they are finished.
@Hilbrand Hilbrand marked this pull request as ready for review March 16, 2026 10:39
@Hilbrand Hilbrand requested a review from BertScholten March 16, 2026 10:39
Copy link
Member

@BertScholten BertScholten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hilbrand Hilbrand merged commit 86a0fc6 into aerius:main Mar 20, 2026
1 check passed
@Hilbrand Hilbrand deleted the startup-guard branch March 20, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants