You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main change in this PR is the modification of the way jobs are
handled in the NSX-T agent. Please see the JobRerunner class for an
in-depth explanation of the changes.
Before this commit jobs are added to one of two queues, called active
and passive. The active queue contains all requests coming in via API
calls, while the passive queue is filled with maintenance and resync
jobs. Both queues used to be priority queues allowing each element to be
added only once.
Jobs then were taken from the active queue until empty, then jobs from
the passive queue would be added to the active queue.
Jobs taken from the active queue would then be submitted to a worker
pool allowing up to 40 greenthreads to run the jobs concurrently
However, to avoid race conditions, only one job is allowed to run per
OpenStack-ID. If more than one job is scheduled to run on the pool,
these additional jobs will wait on a lock and block the worker thread
until the first job is done.
This means that the agent can be blocked and appear fully occupied,
handling 40 tasks simultaneously, while in reality most or all tasks are
waiting for each other.
Instead of scheduling all jobs to the worker pool immediately, risking
a lock, we now first check if the same job is already running, and if
this is the case we will rerun the job after it has finished.
We then can schedule another job that can run to the worker instead.
We need to rerun the job, because a jobs can run for several seconds and
new API requests could arrive during that time.
With this change we also prevent rerunning the same job more than once,
when additional requests arrive while the job is already marked for
re-execution.
Additional fixes and enhancements in this commit:
UniqPriorityQueue:
- fixing add()
If a job is about to be added a second time, but with a higher priority,
the job will correctly not be added, but the priority of the existing
job was not updated. This means jobs from the passive queue, that have a
lower priority, will always be executed last, even if a high-priority
job arrived via API call.
We changed the active queue to a Fifo, to prevent passive jobs to never
get executed and keep execution order of api calls if possible.
With the fix in place, however, we can switch back to a prio queue if
needed.
Runnable:
- fix hash() and make repr more verbose
The Runnable class was not following the requirements for objects that
compare equal to also have the same hashvalue. Also the Runnable was
only taking the OpenStack ID into account, not the name of the function.
Thus a Runnable could, e.g., not be used correctly as a key in
dictionaries.
- __repr__
repr was updated to include the name of the function,
so we see what kind of update is being executed in the logs.
- __lt__
making Runnable order items with same priority by age, preventing
jobs from overtaking each other.
- add timing info for logging
We currently do not get good info about the timings or basic stats
of the jobs running. This commit adds timing info to Runnable and
a method to extract them as string for logging.
0 commit comments