Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a delay between killing teamd processes #3325

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

saiarcot895
Copy link
Contributor

What I did

When killing 10 or more teamd processes, add a delay of 0.1 seconds after every 10 kill signals/proceses. This is because in the LAG scale tests (in ecmp/inner_hashing/test_inner_hashing_lag.py in sonic-mgmt), it may create 100 LAGs, and when destroying them all, some of those LAGs may fail to be properly destroyed, leaving some stale port channels around. This seems to be because the netlink socket buffers on which the teamd processes get notifications become full with events of the other port channels/interfaces going down

Why I did it

As a workaround, add some delays in killing the teamd processes, so that the netlink buffers don't become full, causing messages to get dropped.

This delay was randomly chosen, and it seems to work well with 100 LAGs on a KVM. It can probably made to be a bit more aggressive if needed (i.e. maybe 0.05 seconds every 20 processes).

How I verified it

On a KVM testbed with t0-116 topology with a bit more than 100 LAGs, stop teamd using sudo systemctl stop teamd, and verify that all of the LAGs were deleted, and there were no messages from the kernel similar to the following:

Oct 12 21:33:03 vlab-04 kernel: PortChannel41 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel17 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send port change of device Ethernet136 via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Port device Ethernet136 removed
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send port change of device Ethernet174 via netlink (err -105)

Details if related

Partial fix for sonic-net/sonic-buildimage#19310.

When killing 10 or more teamd processes, add a delay of 0.1 seconds
after every 10 kill signals/proceses. This is because in the LAG scale
tests (in `ecmp/inner_hashing/test_inner_hashing_lag.py` in sonic-mgmt),
it may create 100 LAGs, and when destroying them all, some of those LAGs
may fail to be properly destroyed, leaving some stale port channels
around. This seems to be because the netlink socket buffers on which the
teamd processes get notifications become full with events of the other
port channels/interfaces going down.

As a workaround, add some delays in killing the teamd processes, so that
the netlink buffers don't become full, causing messages to get dropped.

This delay was randomly chosen, and it seems to work well with 100 LAGs
on a KVM. It can probably made to be a bit more aggressive if needed
(i.e. maybe 0.05 seconds every 20 processes).

Signed-off-by: Saikrishna Arcot <[email protected]>
This requires overriding some libc functions and capturing information
about kill signals sent or intercepting file open operations.

Signe -off-by: Saikrishna Arcot <[email protected]>
@saiarcot895
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants