Add a delay between killing teamd processes #3325

saiarcot895 · 2024-10-14T01:13:49Z

What I did

When killing 10 or more teamd processes, add a delay of 0.1 seconds after every 10 kill signals/proceses. This is because in the LAG scale tests (in ecmp/inner_hashing/test_inner_hashing_lag.py in sonic-mgmt), it may create 100 LAGs, and when destroying them all, some of those LAGs may fail to be properly destroyed, leaving some stale port channels around. This seems to be because the netlink socket buffers on which the teamd processes get notifications become full with events of the other port channels/interfaces going down

Why I did it

As a workaround, add some delays in killing the teamd processes, so that the netlink buffers don't become full, causing messages to get dropped.

This delay was randomly chosen, and it seems to work well with 100 LAGs on a KVM. It can probably made to be a bit more aggressive if needed (i.e. maybe 0.05 seconds every 20 processes).

How I verified it

On a KVM testbed with t0-116 topology with a bit more than 100 LAGs, stop teamd using sudo systemctl stop teamd, and verify that all of the LAGs were deleted, and there were no messages from the kernel similar to the following:

Oct 12 21:33:03 vlab-04 kernel: PortChannel41 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel17 (unregistering): Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Failed to send port change of device Ethernet136 via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel22: Port device Ethernet136 removed
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send options change via netlink (err -105)
Oct 12 21:33:03 vlab-04 kernel: PortChannel43: Failed to send port change of device Ethernet174 via netlink (err -105)

Details if related

Partial fix for sonic-net/sonic-buildimage#19310.

When killing 10 or more teamd processes, add a delay of 0.1 seconds after every 10 kill signals/proceses. This is because in the LAG scale tests (in `ecmp/inner_hashing/test_inner_hashing_lag.py` in sonic-mgmt), it may create 100 LAGs, and when destroying them all, some of those LAGs may fail to be properly destroyed, leaving some stale port channels around. This seems to be because the netlink socket buffers on which the teamd processes get notifications become full with events of the other port channels/interfaces going down. As a workaround, add some delays in killing the teamd processes, so that the netlink buffers don't become full, causing messages to get dropped. This delay was randomly chosen, and it seems to work well with 100 LAGs on a KVM. It can probably made to be a bit more aggressive if needed (i.e. maybe 0.05 seconds every 20 processes). Signed-off-by: Saikrishna Arcot <[email protected]>

Signed-off-by: Saikrishna Arcot <[email protected]>

This requires overriding some libc functions and capturing information about kill signals sent or intercepting file open operations. Signe -off-by: Saikrishna Arcot <[email protected]>

saiarcot895 · 2024-10-22T01:20:05Z

/azpw run

mssonicbld · 2024-10-22T01:20:07Z

/AzurePipelines run

azure-pipelines · 2024-10-22T01:20:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Saikrishna Arcot <[email protected]>

saiarcot895 requested a review from judyjoseph as a code owner October 14, 2024 01:13

dgsudharsan added the Request for 202405 Branch label Oct 16, 2024

saiarcot895 added 2 commits October 21, 2024 17:45

Update LAG removal code to use the same logic as cleaning up all LAGs

f4fd3ab

Signed-off-by: Saikrishna Arcot <[email protected]>

Update tests to test LAG cleanup and to test with the new code

7b6fc53

This requires overriding some libc functions and capturing information about kill signals sent or intercepting file open operations. Signe -off-by: Saikrishna Arcot <[email protected]>

saiarcot895 requested a review from prsunny as a code owner October 22, 2024 00:47

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

27f6d3c

saiarcot895 and others added 3 commits October 22, 2024 15:59

Merge remote-tracking branch 'origin/master' into teamd-delay-kill

bdd47c7

Add more tests to cover more cases

c5d84cf

Signed-off-by: Saikrishna Arcot <[email protected]>

Merge branch 'master' into teamd-delay-kill

1dd20a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a delay between killing teamd processes #3325

Add a delay between killing teamd processes #3325

saiarcot895 commented Oct 14, 2024

saiarcot895 commented Oct 22, 2024

mssonicbld commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024

Add a delay between killing teamd processes #3325

Are you sure you want to change the base?

Add a delay between killing teamd processes #3325

Conversation

saiarcot895 commented Oct 14, 2024

saiarcot895 commented Oct 22, 2024

mssonicbld commented Oct 22, 2024

azure-pipelines bot commented Oct 22, 2024