Fix blocking worker threads, possible race conditions and improve logging #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

mutax merged 4 commits into stable/yoga-m3 from i583594/race_fix_and_queue_log

Mar 6, 2025

Contributor

mutax commented Feb 14, 2025 •

edited

Loading

With this PR we modify the scheduling of jobs to the worker pool, improve logging and allow single unit tests to be executed.

On the way we fixed several bugs and inconsistencies in the existing code, required to implement the changes.

Please refer to the commit message of the individual changes, the explanation for the changes in the scheduling is rather long.

mutax force-pushed the i583594/race_fix_and_queue_log branch 8 times, most recently from 9411bea to 2f1a77a Compare

February 19, 2025 13:44

mutax marked this pull request as ready for review

February 19, 2025 14:46

mutax requested a review from sven-rosenzweig

February 19, 2025 14:46

mutax changed the title ~~wip: I583594/race fix and queue log~~ Fix blocking worker threads, possible race conditions and improve logging

mutax force-pushed the i583594/race_fix_and_queue_log branch 8 times, most recently from 3a73a6b to 9d0b866 Compare

February 20, 2025 14:06

sven-rosenzweig reviewed

View reviewed changes

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

mutax force-pushed the i583594/race_fix_and_queue_log branch 4 times, most recently from 431a684 to 9444e90 Compare

February 20, 2025 19:57

mutax mentioned this pull request

Prevent Jobs from Starving in the Active Queue #144

Closed

mutax changed the base branch from fix/stuck_in_active_queue to stable/yoga-m3

February 20, 2025 19:59

mutax force-pushed the i583594/race_fix_and_queue_log branch from 9444e90 to 7ff517d Compare

February 21, 2025 10:11

github-actions bot commented Feb 21, 2025 •

edited

Loading

Name                                                                      Stmts   Miss  Cover
---------------------------------------------------------------------------------------------
networking_nsxv3/api/rpc.py                                                 233    110    53%
networking_nsxv3/common/config.py                                            16      0   100%
networking_nsxv3/common/constants.py                                         23      0   100%
networking_nsxv3/common/locking.py                                           35     11    69%
networking_nsxv3/common/synchronization.py                                  379     72    81%
networking_nsxv3/db/db.py                                                    94     19    80%
networking_nsxv3/extensions/nsxtoperations.py                               104     40    62%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/agent.py                   162     52    68%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/cli.py                     299    195    35%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/client_nsx.py              187     49    74%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/constants_nsx.py             6      0   100%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/extensions/firewall.py      27      0   100%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/provider.py                169     10    94%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/provider_nsx_policy.py     765    115    85%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/realization.py             203     33    84%
networking_nsxv3/plugins/ml2/drivers/nsxv3/driver.py                        129     74    43%
networking_nsxv3/prometheus/exporter.py                                      19      5    74%
networking_nsxv3/services/logapi/drivers/nsxv3/driver.py                     41      1    98%
networking_nsxv3/services/qos/drivers/nsxv3/qos.py                           34      4    88%
networking_nsxv3/services/trunk/drivers/nsxv3/trunk.py                       71      3    96%
---------------------------------------------------------------------------------------------
TOTAL                                                                      2996    793    74%

mutax force-pushed the i583594/race_fix_and_queue_log branch from 7ff517d to 44d56dd Compare

February 21, 2025 10:29

sven-rosenzweig reviewed

View reviewed changes

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

mutax force-pushed the i583594/race_fix_and_queue_log branch 4 times, most recently from 5df2f2c to 69f5e8f Compare

February 21, 2025 15:21

sven-rosenzweig approved these changes

View reviewed changes

mutax added 3 commits

February 21, 2025 16:37


          [tests] allow running only a subset of unit tests

2451f8a


          [cli] print the invalid command for debugging the unittests

3bf78ec


          [tests] replace assertEquals for membership test with assertIn

4d7abed

mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from 1358eec to c655d7f Compare

February 21, 2025 16:21

felix-kaestner reviewed

View reviewed changes

networking_nsxv3/tests/unit/realization/test_coordination.py Outdated Show resolved Hide resolved

joker-at-work reviewed

View reviewed changes

Contributor

joker-at-work left a comment

first glance over it

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

mutax marked this pull request as draft

February 24, 2025 13:40

Contributor Author

mutax commented Feb 24, 2025

marked PR as draft, need some more work -- contradicting the code and comments in the code, jid is not always a string containing an OS-ID, sometimes its a dictionary with the parameters for the actual change. This of course breaks a lot of assumptions.

mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from e516072 to 876350f Compare

February 27, 2025 15:26

mutax marked this pull request as ready for review

February 27, 2025 17:31

mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from ba3b888 to 126b797 Compare

March 5, 2025 12:10

felix-kaestner reviewed

View reviewed changes

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated Show resolved Hide resolved

sven-rosenzweig reviewed

View reviewed changes

networking_nsxv3/common/synchronization.py Show resolved Hide resolved

networking_nsxv3/common/synchronization.py Outdated

+                      done will then choose a job from the list, that is supposed to run again,
+                      remove it from the list and return it to the JobRerunner.
+                      This might be the same job that was just finished or it could be a different one.

Contributor

sven-rosenzweig Mar 6, 2025

With different one == same job but other parameters, right?

Contributor Author

mutax Mar 6, 2025

yes, will clarify

networking_nsxv3/common/synchronization.py

+                      if self._job_identifier is not None:
+                          if job.identifier != self._job_identifier:
+                              raise ValueError("Can only add jobs of same type to a JobList")

Contributor

sven-rosenzweig Mar 6, 2025

What happens if we raise this exception? Might a job not be added to the queue?

Contributor Author

mutax Mar 6, 2025

Yes, this is a programming error and should never happen -- there is not much we can do at this point than raise an exception and log it as an error.

networking_nsxv3/common/synchronization.py Outdated

+                          if job == existing_job:
+                              count += 1
+                              self._runnables[index] = (count, existing_job)
+                              if count > 1:

Contributor

sven-rosenzweig Mar 6, 2025

Does not work as stated in the comment.

Contributor Author

mutax Mar 6, 2025

fixed - please check, reworked the surrounding code.

networking_nsxv3/common/synchronization.py Show resolved Hide resolved

felix-kaestner approved these changes

View reviewed changes

sven-rosenzweig approved these changes

View reviewed changes


          Reworking scheduling of jobs to runners

3d8bbe3

The main change in this commit is a modification of the way jobs are
handled in the NSX-T agent. Please see the JobRerunner class for an
in-depth explanation of the changes.

Before this commit jobs are added to one of two queues, called active
and passive. The active queue contains all requests coming in via API
calls, while the passive queue is filled with maintenance and resync
jobs. Both queues used to be priority queues allowing each element to be
added only once.

Jobs then were taken from the active queue until empty, then jobs from
the passive queue would be added to the active queue.

Jobs taken from the active queue would then be submitted to a worker
pool allowing up to 40 greenthreads to run the jobs concurrently

However, to avoid race conditions, only one job is allowed to run per
OpenStack-ID. If more than one job is scheduled to run on the pool,
these additional jobs will wait on a lock and block the worker thread
until the first job is done.

This means that the agent can be blocked and appear fully occupied,
handling 40 tasks simultaneously, while in reality most or all tasks are
waiting for each other.

Instead of scheduling all jobs to the worker pool immediately, risking
a lock, we now first check if the same job is already running, and if
this is the case we will rerun the job after it has finished.
We then can schedule another job that can run to the worker instead.

We need to rerun the job, because a jobs can run for several seconds and
new API requests could arrive during that time.

With this change we also prevent rerunning the same job more than once,
when additional requests arrive while the job is already marked for
re-execution.

While implementing these changes, we found out that some api calls,
and thus the resulting jobs, will get a dictionary and not a string as
parameter, although indicated differently in the code.

To support these calls, we have to handle that case as well, they are:
  address_group_update and {enable, disable, update}_policy_logging

Additional fixes and enhancements in this commit:

UniqPriorityQueue:

- fixing add()

If a job is about to be added a second time, but with a higher priority,
the job will correctly not be added, but the priority of the existing
job was not updated. This means jobs from the passive queue, that have a
lower priority, will always be executed last, even if a high-priority
job arrived via API call.

We changed the active queue to a Fifo, to prevent passive jobs to never
get executed and keep execution order of api calls if possible.
With the fix in place, however, we can switch back to a prio queue if
needed.

Runnable:

- fix hash() and make repr more verbose

The Runnable class was not following the requirements for objects that
compare equal to also have the same hashvalue. Also the Runnable was
only taking the OpenStack ID into account, not the name of the function.
Thus a Runnable could, e.g., not be used correctly as a key in
dictionaries.

- __repr__

repr was updated to include the name of the function,
so we see what kind of update is being executed in the logs.

- __lt__

making Runnable order items with same priority by age, preventing
jobs  from overtaking each other.

- add timing info for logging

We currently do not get good info about the timings or basic stats
of the jobs running. This commit adds timing info to Runnable and
a method to extract them as string for logging.

mutax force-pushed the i583594/race_fix_and_queue_log branch from 6cd1c8c to 3d8bbe3 Compare

March 6, 2025 14:56

mutax merged commit cf23b99 into stable/yoga-m3

4 checks passed

mutax deleted the i583594/race_fix_and_queue_log branch

March 6, 2025 15:03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet