-
Notifications
You must be signed in to change notification settings - Fork 6
Commit 3d8bbe3
committed
Reworking scheduling of jobs to runners
The main change in this commit is a modification of the way jobs are
handled in the NSX-T agent. Please see the JobRerunner class for an
in-depth explanation of the changes.
Before this commit jobs are added to one of two queues, called active
and passive. The active queue contains all requests coming in via API
calls, while the passive queue is filled with maintenance and resync
jobs. Both queues used to be priority queues allowing each element to be
added only once.
Jobs then were taken from the active queue until empty, then jobs from
the passive queue would be added to the active queue.
Jobs taken from the active queue would then be submitted to a worker
pool allowing up to 40 greenthreads to run the jobs concurrently
However, to avoid race conditions, only one job is allowed to run per
OpenStack-ID. If more than one job is scheduled to run on the pool,
these additional jobs will wait on a lock and block the worker thread
until the first job is done.
This means that the agent can be blocked and appear fully occupied,
handling 40 tasks simultaneously, while in reality most or all tasks are
waiting for each other.
Instead of scheduling all jobs to the worker pool immediately, risking
a lock, we now first check if the same job is already running, and if
this is the case we will rerun the job after it has finished.
We then can schedule another job that can run to the worker instead.
We need to rerun the job, because a jobs can run for several seconds and
new API requests could arrive during that time.
With this change we also prevent rerunning the same job more than once,
when additional requests arrive while the job is already marked for
re-execution.
While implementing these changes, we found out that some api calls,
and thus the resulting jobs, will get a dictionary and not a string as
parameter, although indicated differently in the code.
To support these calls, we have to handle that case as well, they are:
address_group_update and {enable, disable, update}_policy_logging
Additional fixes and enhancements in this commit:
UniqPriorityQueue:
- fixing add()
If a job is about to be added a second time, but with a higher priority,
the job will correctly not be added, but the priority of the existing
job was not updated. This means jobs from the passive queue, that have a
lower priority, will always be executed last, even if a high-priority
job arrived via API call.
We changed the active queue to a Fifo, to prevent passive jobs to never
get executed and keep execution order of api calls if possible.
With the fix in place, however, we can switch back to a prio queue if
needed.
Runnable:
- fix hash() and make repr more verbose
The Runnable class was not following the requirements for objects that
compare equal to also have the same hashvalue. Also the Runnable was
only taking the OpenStack ID into account, not the name of the function.
Thus a Runnable could, e.g., not be used correctly as a key in
dictionaries.
- __repr__
repr was updated to include the name of the function,
so we see what kind of update is being executed in the logs.
- __lt__
making Runnable order items with same priority by age, preventing
jobs from overtaking each other.
- add timing info for logging
We currently do not get good info about the timings or basic stats
of the jobs running. This commit adds timing info to Runnable and
a method to extract them as string for logging.1 parent 4d7abed commit 3d8bbe3Copy full SHA for 3d8bbe3
File tree
Expand file treeCollapse file tree
2 files changed
+678
-22
lines changedFilter options
- networking_nsxv3
- common
- tests/unit/realization
Expand file treeCollapse file tree
2 files changed
+678
-22
lines changed
0 commit comments