Skip to content

Conversation

mutax
Copy link
Contributor

@mutax mutax commented Feb 14, 2025

With this PR we modify the scheduling of jobs to the worker pool, improve logging and allow single unit tests to be executed.

On the way we fixed several bugs and inconsistencies in the existing code, required to implement the changes.

Please refer to the commit message of the individual changes, the explanation for the changes in the scheduling is rather long.

@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 8 times, most recently from 9411bea to 2f1a77a Compare February 19, 2025 13:44
@mutax mutax marked this pull request as ready for review February 19, 2025 14:46
@mutax mutax changed the title wip: I583594/race fix and queue log Fix blocking worker threads, possible race conditions and improve logging Feb 19, 2025
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 8 times, most recently from 3a73a6b to 9d0b866 Compare February 20, 2025 14:06
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 4 times, most recently from 431a684 to 9444e90 Compare February 20, 2025 19:57
@mutax mutax changed the base branch from fix/stuck_in_active_queue to stable/yoga-m3 February 20, 2025 19:59
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch from 9444e90 to 7ff517d Compare February 21, 2025 10:11
Copy link

github-actions bot commented Feb 21, 2025

Name                                                                      Stmts   Miss  Cover
---------------------------------------------------------------------------------------------
networking_nsxv3/api/rpc.py                                                 233    110    53%
networking_nsxv3/common/config.py                                            16      0   100%
networking_nsxv3/common/constants.py                                         23      0   100%
networking_nsxv3/common/locking.py                                           35     11    69%
networking_nsxv3/common/synchronization.py                                  379     72    81%
networking_nsxv3/db/db.py                                                    94     19    80%
networking_nsxv3/extensions/nsxtoperations.py                               104     40    62%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/agent.py                   162     52    68%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/cli.py                     299    195    35%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/client_nsx.py              187     49    74%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/constants_nsx.py             6      0   100%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/extensions/firewall.py      27      0   100%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/provider.py                169     10    94%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/provider_nsx_policy.py     765    115    85%
networking_nsxv3/plugins/ml2/drivers/nsxv3/agent/realization.py             203     33    84%
networking_nsxv3/plugins/ml2/drivers/nsxv3/driver.py                        129     74    43%
networking_nsxv3/prometheus/exporter.py                                      19      5    74%
networking_nsxv3/services/logapi/drivers/nsxv3/driver.py                     41      1    98%
networking_nsxv3/services/qos/drivers/nsxv3/qos.py                           34      4    88%
networking_nsxv3/services/trunk/drivers/nsxv3/trunk.py                       71      3    96%
---------------------------------------------------------------------------------------------
TOTAL                                                                      2996    793    74%

@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch from 7ff517d to 44d56dd Compare February 21, 2025 10:29
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 4 times, most recently from 5df2f2c to 69f5e8f Compare February 21, 2025 15:21
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from 1358eec to c655d7f Compare February 21, 2025 16:21
Copy link
Contributor

@joker-at-work joker-at-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first glance over it

@mutax mutax marked this pull request as draft February 24, 2025 13:40
@mutax
Copy link
Contributor Author

mutax commented Feb 24, 2025

marked PR as draft, need some more work -- contradicting the code and comments in the code, jid is not always a string containing an OS-ID, sometimes its a dictionary with the parameters for the actual change. This of course breaks a lot of assumptions.

@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from e516072 to 876350f Compare February 27, 2025 15:26
@mutax mutax marked this pull request as ready for review February 27, 2025 17:31
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch 3 times, most recently from ba3b888 to 126b797 Compare March 5, 2025 12:10
done will then choose a job from the list, that is supposed to run again,
remove it from the list and return it to the JobRerunner.

This might be the same job that was just finished or it could be a different one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With different one == same job but other parameters, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will clarify


if self._job_identifier is not None:
if job.identifier != self._job_identifier:
raise ValueError("Can only add jobs of same type to a JobList")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we raise this exception? Might a job not be added to the queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a programming error and should never happen -- there is not much we can do at this point than raise an exception and log it as an error.

if job == existing_job:
count += 1
self._runnables[index] = (count, existing_job)
if count > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not work as stated in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed - please check, reworked the surrounding code.

The main change in this commit is a modification of the way jobs are
handled in the NSX-T agent. Please see the JobRerunner class for an
in-depth explanation of the changes.

Before this commit jobs are added to one of two queues, called active
and passive. The active queue contains all requests coming in via API
calls, while the passive queue is filled with maintenance and resync
jobs. Both queues used to be priority queues allowing each element to be
added only once.

Jobs then were taken from the active queue until empty, then jobs from
the passive queue would be added to the active queue.

Jobs taken from the active queue would then be submitted to a worker
pool allowing up to 40 greenthreads to run the jobs concurrently

However, to avoid race conditions, only one job is allowed to run per
OpenStack-ID. If more than one job is scheduled to run on the pool,
these additional jobs will wait on a lock and block the worker thread
until the first job is done.

This means that the agent can be blocked and appear fully occupied,
handling 40 tasks simultaneously, while in reality most or all tasks are
waiting for each other.

Instead of scheduling all jobs to the worker pool immediately, risking
a lock, we now first check if the same job is already running, and if
this is the case we will rerun the job after it has finished.
We then can schedule another job that can run to the worker instead.

We need to rerun the job, because a jobs can run for several seconds and
new API requests could arrive during that time.

With this change we also prevent rerunning the same job more than once,
when additional requests arrive while the job is already marked for
re-execution.

While implementing these changes, we found out that some api calls,
and thus the resulting jobs, will get a dictionary and not a string as
parameter, although indicated differently in the code.

To support these calls, we have to handle that case as well, they are:
  address_group_update and {enable, disable, update}_policy_logging

Additional fixes and enhancements in this commit:

UniqPriorityQueue:

- fixing add()

If a job is about to be added a second time, but with a higher priority,
the job will correctly not be added, but the priority of the existing
job was not updated. This means jobs from the passive queue, that have a
lower priority, will always be executed last, even if a high-priority
job arrived via API call.

We changed the active queue to a Fifo, to prevent passive jobs to never
get executed and keep execution order of api calls if possible.
With the fix in place, however, we can switch back to a prio queue if
needed.

Runnable:

- fix hash() and make repr more verbose

The Runnable class was not following the requirements for objects that
compare equal to also have the same hashvalue. Also the Runnable was
only taking the OpenStack ID into account, not the name of the function.
Thus a Runnable could, e.g., not be used correctly as a key in
dictionaries.

- __repr__

repr was updated to include the name of the function,
so we see what kind of update is being executed in the logs.

- __lt__

making Runnable order items with same priority by age, preventing
jobs  from overtaking each other.

- add timing info for logging

We currently do not get good info about the timings or basic stats
of the jobs running. This commit adds timing info to Runnable and
a method to extract them as string for logging.
@mutax mutax force-pushed the i583594/race_fix_and_queue_log branch from 6cd1c8c to 3d8bbe3 Compare March 6, 2025 14:56
@mutax mutax merged commit cf23b99 into stable/yoga-m3 Mar 6, 2025
4 checks passed
@mutax mutax deleted the i583594/race_fix_and_queue_log branch March 6, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants