Commit cf23b99

committed

Reworking scheduling of jobs to runners

The main change in this commit is a modification of the way jobs are handled in the NSX-T agent. Please see the JobRerunner class for an in-depth explanation of the changes. Before this commit jobs are added to one of two queues, called active and passive. The active queue contains all requests coming in via API calls, while the passive queue is filled with maintenance and resync jobs. Both queues used to be priority queues allowing each element to be added only once. Jobs then were taken from the active queue until empty, then jobs from the passive queue would be added to the active queue. Jobs taken from the active queue would then be submitted to a worker pool allowing up to 40 greenthreads to run the jobs concurrently However, to avoid race conditions, only one job is allowed to run per OpenStack-ID. If more than one job is scheduled to run on the pool, these additional jobs will wait on a lock and block the worker thread until the first job is done. This means that the agent can be blocked and appear fully occupied, handling 40 tasks simultaneously, while in reality most or all tasks are waiting for each other. Instead of scheduling all jobs to the worker pool immediately, risking a lock, we now first check if the same job is already running, and if this is the case we will rerun the job after it has finished. We then can schedule another job that can run to the worker instead. We need to rerun the job, because a jobs can run for several seconds and new API requests could arrive during that time. With this change we also prevent rerunning the same job more than once, when additional requests arrive while the job is already marked for re-execution. While implementing these changes, we found out that some api calls, and thus the resulting jobs, will get a dictionary and not a string as parameter, although indicated differently in the code. To support these calls, we have to handle that case as well, they are: address_group_update and {enable, disable, update}_policy_logging Additional fixes and enhancements in this commit: UniqPriorityQueue: - fixing add() If a job is about to be added a second time, but with a higher priority, the job will correctly not be added, but the priority of the existing job was not updated. This means jobs from the passive queue, that have a lower priority, will always be executed last, even if a high-priority job arrived via API call. We changed the active queue to a Fifo, to prevent passive jobs to never get executed and keep execution order of api calls if possible. With the fix in place, however, we can switch back to a prio queue if needed. Runnable: - fix hash() and make repr more verbose The Runnable class was not following the requirements for objects that compare equal to also have the same hashvalue. Also the Runnable was only taking the OpenStack ID into account, not the name of the function. Thus a Runnable could, e.g., not be used correctly as a key in dictionaries. - __repr__ repr was updated to include the name of the function, so we see what kind of update is being executed in the logs. - __lt__ making Runnable order items with same priority by age, preventing jobs from overtaking each other. - add timing info for logging We currently do not get good info about the timings or basic stats of the jobs running. This commit adds timing info to Runnable and a method to extract them as string for logging.

1 parent daad4c2 commit cf23b99Copy full SHA for cf23b99

2 files changed

+678

-22

lines changed

networking_nsxv3
- common
  - synchronization.py
- tests/unit/realization
  - test_coordination.py

2 files changed

+678

-22

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit cf23b99

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments