-
Notifications
You must be signed in to change notification settings - Fork 6
Fix blocking worker threads, possible race conditions and improve logging #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9411bea
to
2f1a77a
Compare
3a73a6b
to
9d0b866
Compare
431a684
to
9444e90
Compare
9444e90
to
7ff517d
Compare
|
7ff517d
to
44d56dd
Compare
5df2f2c
to
69f5e8f
Compare
1358eec
to
c655d7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first glance over it
marked PR as draft, need some more work -- contradicting the code and comments in the code, jid is not always a string containing an OS-ID, sometimes its a dictionary with the parameters for the actual change. This of course breaks a lot of assumptions. |
e516072
to
876350f
Compare
ba3b888
to
126b797
Compare
done will then choose a job from the list, that is supposed to run again, | ||
remove it from the list and return it to the JobRerunner. | ||
|
||
This might be the same job that was just finished or it could be a different one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With different one == same job but other parameters, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will clarify
|
||
if self._job_identifier is not None: | ||
if job.identifier != self._job_identifier: | ||
raise ValueError("Can only add jobs of same type to a JobList") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we raise this exception? Might a job not be added to the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a programming error and should never happen -- there is not much we can do at this point than raise an exception and log it as an error.
if job == existing_job: | ||
count += 1 | ||
self._runnables[index] = (count, existing_job) | ||
if count > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not work as stated in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed - please check, reworked the surrounding code.
The main change in this commit is a modification of the way jobs are handled in the NSX-T agent. Please see the JobRerunner class for an in-depth explanation of the changes. Before this commit jobs are added to one of two queues, called active and passive. The active queue contains all requests coming in via API calls, while the passive queue is filled with maintenance and resync jobs. Both queues used to be priority queues allowing each element to be added only once. Jobs then were taken from the active queue until empty, then jobs from the passive queue would be added to the active queue. Jobs taken from the active queue would then be submitted to a worker pool allowing up to 40 greenthreads to run the jobs concurrently However, to avoid race conditions, only one job is allowed to run per OpenStack-ID. If more than one job is scheduled to run on the pool, these additional jobs will wait on a lock and block the worker thread until the first job is done. This means that the agent can be blocked and appear fully occupied, handling 40 tasks simultaneously, while in reality most or all tasks are waiting for each other. Instead of scheduling all jobs to the worker pool immediately, risking a lock, we now first check if the same job is already running, and if this is the case we will rerun the job after it has finished. We then can schedule another job that can run to the worker instead. We need to rerun the job, because a jobs can run for several seconds and new API requests could arrive during that time. With this change we also prevent rerunning the same job more than once, when additional requests arrive while the job is already marked for re-execution. While implementing these changes, we found out that some api calls, and thus the resulting jobs, will get a dictionary and not a string as parameter, although indicated differently in the code. To support these calls, we have to handle that case as well, they are: address_group_update and {enable, disable, update}_policy_logging Additional fixes and enhancements in this commit: UniqPriorityQueue: - fixing add() If a job is about to be added a second time, but with a higher priority, the job will correctly not be added, but the priority of the existing job was not updated. This means jobs from the passive queue, that have a lower priority, will always be executed last, even if a high-priority job arrived via API call. We changed the active queue to a Fifo, to prevent passive jobs to never get executed and keep execution order of api calls if possible. With the fix in place, however, we can switch back to a prio queue if needed. Runnable: - fix hash() and make repr more verbose The Runnable class was not following the requirements for objects that compare equal to also have the same hashvalue. Also the Runnable was only taking the OpenStack ID into account, not the name of the function. Thus a Runnable could, e.g., not be used correctly as a key in dictionaries. - __repr__ repr was updated to include the name of the function, so we see what kind of update is being executed in the logs. - __lt__ making Runnable order items with same priority by age, preventing jobs from overtaking each other. - add timing info for logging We currently do not get good info about the timings or basic stats of the jobs running. This commit adds timing info to Runnable and a method to extract them as string for logging.
6cd1c8c
to
3d8bbe3
Compare
With this PR we modify the scheduling of jobs to the worker pool, improve logging and allow single unit tests to be executed.
On the way we fixed several bugs and inconsistencies in the existing code, required to implement the changes.
Please refer to the commit message of the individual changes, the explanation for the changes in the scheduling is rather long.