cmdDel fails releasing the device when kubelet deletes pause container #126

blackgold · 2020-05-24T16:21:29Z

What happened?

Kubelet doesn't gaurentee to keep pause container alive while cni tries to delete all the devices attached to the pod. When the pause container is deleted, the netns is not available to release the device from cmdDel. This results in the device on host with wrong name, missing ip and wrong settings.

What did you expect to happen?

Kubelet to provide some guarantee that netns is available for the cni to delete all attached devices.

What are the minimal steps needed to reproduce the bug?

Attach atleast 4 sriov devices to pod. Kill the pod.
To consistently reproduce the error, add 1 sec sleep in cmdDel.

Anything else we need to know?

Raised the issue with kubernetes and unable to get any positive response.
kubernetes/kubernetes#89440
As a workaround having a daemon that tries to fix the broken device on the host periodically.

Component Versions

Please fill in the below table with the version numbers of applicable components used.

Component	Version
SR-IOV CNI Plugin	v2.2
Multus	v3.4
SR-IOV Network Device Plugin	v2.2
Kubernetes	1.13.5
OS	Ubuntu 18

Config Files

Config file locations may be config dependent.

CNI config (Try '/etc/cni/net.d/')

Device pool config file location (Try '/etc/pcidp/config.json')

Multus config (Try '/etc/cni/multus/net.d')

Kubernetes deployment type ( Bare Metal, Kubeadm etc.)

Kubeconfig file

SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use `kubectl logs $PODNAME`)

Added some custom logs to print cmdArgs and netns
time="2020-04-24T17:22:52Z" level=info msg="read from cache &{NetConf:{CNIVersion:0.3.1 Name:sriov-network Type:sriov Capabilities:map[] IPAM:{Type:} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[dns:map[] interfaces:[map[name:net1 sandbox:/proc/4281/ns/net]]] PrevResult:} DPDKMode:false Master:enp5s0 MAC: AdminMAC: EffectiveMAC: Vlan:0 VlanQoS:0 DeviceID:0000:05:00.1 VFID:0 HostIFNames:net1 ContIFNames:net1 MinTxRate: MaxTxRate: SpoofChk: Trust: LinkState: Delegates:[{CNIVersion:0.3.1 Name:sbr Type:sbr Capabilities:map[] IPAM:{Type:} DNS:{Nameservers:[] Domain: Search:[] Options:[]} RawPrevResult:map[] PrevResult:}] RuntimeConfig:{Mac:} IPNet:}"
time="2020-04-24T17:22:52Z" level=info msg="empty netns , error = failed to Statfs "/proc/4281/ns/net": no such file or directory"
time="2020-04-24T17:22:52Z" level=info msg="ReleaseVF "
time="2020-04-24T17:22:52Z" level=error msg="failed to get netlink device with name net1"

Multus logs (If enabled. Try '/var/log/multus.log' )

Kubelet logs (journalctl -u kubelet)

Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [error] Multus: error in invoke Delegate del - "sriov": error in removing device from net namespace: 1failed to get netlink device with name net3: Link not found
Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [debug] delegateDel: , net2, &{{0.3.1 sriov-network sriov map[] {} {[] [] []}} { []} false false [123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 101 108 101 103 97 116 101 115 34 58 91 123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 110 9 101 34 58 34 115 98 114 34 44 34 116 121 112 101 34 58 34 115 98 114 34 125 93 44 34 100 101 118 105 99 101 73 68 34 58 34 48 48 48 48 58 48 99 58 48 48 46 49 34 44 34 110 97 109 101 34 58 34 115 114 105 111 118 45 110 101 116 119 111 114 107 34 44 34 116 121 112 101 34 58 34 115 114 105 111 118 34 125]}, &{cfba15035e7ef328153ba5c88853b52f97740560bc27a0707ab2f5b536a8f863 /proc/32764/ns/net net2 [[IgnoreUnknown 1] [K8S_POD_NAMESPACE user] [K8S_POD_NAME 847138-worker-1] [K8S_POD_INFRA_CONTAINER_ID cfba15035e7ef328153ba5c88853b52f97740560bc27a0707ab2f5b536a8f863]] map[] }, /opt/cni/bin
Mar 23 21:04:42 dgx0098 kubelet[29124]: 2020-03-23T21:04:42Z [verbose] Del: user:847138-worker-1:sriov-network:net2 {"cniVersion":"0.3.1","delegates":[{"cniVersion":"0.3.1","name":"sbr","type":"sbr"}],"deviceID":"0000:0c:00.1","name":"sriov-network","type":"sriov"}
Mar 23 21:04:46 dgx0098 kubelet[29124]: I0323 21:04:46.544632 29124 plugins.go:391] Calling network plugin cni to tear down pod "847138-worker-1_user"

The text was updated successfully, but these errors were encountered:

blackgold · 2020-05-24T16:26:22Z

Kind of using a sriov-cleaner daemon to clean up the device that ends up in bad shape on the host. Looking for suggestions on ideal solution.

zshi-redhat · 2020-05-25T03:42:56Z

I think the long term solution would be to wait for the kubelet fix.

For the workaround:
sriov-cni does two things in container namespace when releasing a VF:

rename VF
reset effective MAC address
If sriov-cni can capture the failure that netns doesn't exist any more in ReleaseVF, it might be able to continue the releasing process in the init netns (assume VF be released to the default host netns upon failure ). Is there any other information need to recover from an early deletion pause container?

blackgold · 2020-05-25T19:04:19Z

I tried switching to init namespace, however the device is not visible. The device is visible on the host only after cmdDel command is called on all the devices.

JaseFace · 2021-05-28T20:48:21Z

We're running into this currently. We have 4-6 interfaces in use by the CNI, but are often finding 1 or 2 left with a bad interface name, and various settings that weren't reverted. The host usually has enough information to fix the handed back/abandoned interfaces after the failed/incomplete cmdDel. The struggle is we're then racing the cleanup against the pods spinning back up and requesting new interfaces. If they hit one of the abandoned interfaces before cleanup, things go south.

Also when something like the Mellanox E-Switch is involved, the host doesn't have enough information to safely nuke entries when MACs are being reused.

blackgold · 2021-05-28T21:51:57Z

We're running into this currently. We have 4-6 interfaces in use by the CNI, but are often finding 1 or 2 left with a bad interface name, and various settings that weren't reverted. The host usually has enough information to fix the handed back/abandoned interfaces after the failed/incomplete cmdDel. The struggle is we're then racing the cleanup against the pods spinning back up and requesting new interfaces. If they hit one of the abandoned interfaces before cleanup, things go south.

Also when something like the Mellanox E-Switch is involved, the host doesn't have enough information to safely nuke entries when MACs are being reused.
@JaseFace
I think it would be ideal to implement a cni plugin that does two things in following order in cmdAdd

Verify the devices on the host are in expected format. If not fix them.
Delegate to sriov-cni

@zshi-redhat
If sriov-cni can provide some ability to run prolog hooks then we can invoke (1) using the prolog hooks.

martinkennelly · 2021-05-31T15:13:59Z

@blackgold

I tried switching to init namespace, however the device is not visible

Do you know why they aren't visible in the init netns? I figured, once the pod netns is deleted, the devices would return to the init netns. SRIOV CNI could detect the pod netns is deleted but continue on and verify the device is in the appropriate state.

blackgold · 2021-06-05T16:34:03Z

So from within the sriov-cni process when i tried to list devices in the init ns, the devices dont show up. Only after the last cmdDel invocation finishes the devices show up in host ns. ( atleast thats what i remember).
I thought it has something to do with kubelet holding some reference to the container netns.

zshi-redhat · 2021-07-05T15:00:03Z

Kind of using a sriov-cleaner daemon to clean up the device that ends up in bad shape on the host. Looking for suggestions on ideal solution.

Would it be helpful if we do the device health check in the device plugin (when kubelet sends the allocate request to device plugin)? For example, if the requested device is not in init netns (or not in an expected state than it should be during discovery), device plugin would return unhealthy to kubelet which then repeat the allocate process with another device.

blackgold · 2021-07-07T15:56:10Z

In our use case the job uses all IB devices for training. If even one device is not healthy the job will not run.
It will be nice to have some facility to fix the device.

jwolfe-ns · 2021-08-11T23:01:02Z

As a follow up, our post cmdDel() failure cleanup now resets the VF, which causes all E-Switch entries related to that VF to be removed also. This prevents MAC collisions in the E-Switch, as we are changing them for bonding. Since the host namespace only sees VFs that aren't assigned out, we can 'safely' reset all VFs we see without concern about their state.

Though we still have the race condition where the released VF in a bad state might be assigned out before we clean it.

YitzyD · 2022-04-04T14:51:09Z

Any updates here? Using the sriov-cni with dhcp ipam seems to exacerbate the issue as well.

SchSeba · 2022-08-30T12:54:07Z

Hi @YitzyD @blackgold question can you share your pod yaml?

just to be sure are you using terminationGracePeriodSeconds: 0 ?

SchSeba · 2022-08-30T12:58:05Z

Also with container runtime are you using? I try this with crio and I am not able to reproduce the issue after using #220

YitzyD · 2022-11-29T16:14:41Z

@SchSeba After investigating further, this seems to be an issue related to docker-shim and as you said, after #220 the issue does seem to be resolved.

adrianchiris added the stale This issue did not have any activity nor a conclusion in the past 90 days label Nov 24, 2020

adrianchiris removed the stale This issue did not have any activity nor a conclusion in the past 90 days label Jun 1, 2021

JaseFace mentioned this issue Jun 18, 2021

Defer the MAC clear on the VF until after cmdDel() returns #186

Closed

SchSeba mentioned this issue Aug 24, 2022

VFs may get reseted after being allocated by other pod #219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmdDel fails releasing the device when kubelet deletes pause container #126

cmdDel fails releasing the device when kubelet deletes pause container #126

blackgold commented May 24, 2020 •

edited

Loading

blackgold commented May 24, 2020 •

edited

Loading

zshi-redhat commented May 25, 2020

blackgold commented May 25, 2020

JaseFace commented May 28, 2021

blackgold commented May 28, 2021 •

edited

Loading

martinkennelly commented May 31, 2021 •

edited

Loading

blackgold commented Jun 5, 2021 •

edited

Loading

zshi-redhat commented Jul 5, 2021

blackgold commented Jul 7, 2021

jwolfe-ns commented Aug 11, 2021

YitzyD commented Apr 4, 2022

SchSeba commented Aug 30, 2022

SchSeba commented Aug 30, 2022 •

edited

Loading

YitzyD commented Nov 29, 2022

cmdDel fails releasing the device when kubelet deletes pause container #126

cmdDel fails releasing the device when kubelet deletes pause container #126

Comments

blackgold commented May 24, 2020 • edited Loading

What happened?

What did you expect to happen?

What are the minimal steps needed to reproduce the bug?

Anything else we need to know?

Component Versions

Config Files

CNI config (Try '/etc/cni/net.d/')

Device pool config file location (Try '/etc/pcidp/config.json')

Multus config (Try '/etc/cni/multus/net.d')

Kubernetes deployment type ( Bare Metal, Kubeadm etc.)

Kubeconfig file

SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)

Multus logs (If enabled. Try '/var/log/multus.log' )

Kubelet logs (journalctl -u kubelet)

blackgold commented May 24, 2020 • edited Loading

zshi-redhat commented May 25, 2020

blackgold commented May 25, 2020

JaseFace commented May 28, 2021

blackgold commented May 28, 2021 • edited Loading

martinkennelly commented May 31, 2021 • edited Loading

blackgold commented Jun 5, 2021 • edited Loading

zshi-redhat commented Jul 5, 2021

blackgold commented Jul 7, 2021

jwolfe-ns commented Aug 11, 2021

YitzyD commented Apr 4, 2022

SchSeba commented Aug 30, 2022

SchSeba commented Aug 30, 2022 • edited Loading

YitzyD commented Nov 29, 2022

blackgold commented May 24, 2020 •

edited

Loading

SR-IOV Network Device Plugin Logs (use `kubectl logs $PODNAME`)

blackgold commented May 24, 2020 •

edited

Loading

blackgold commented May 28, 2021 •

edited

Loading

martinkennelly commented May 31, 2021 •

edited

Loading

blackgold commented Jun 5, 2021 •

edited

Loading

SchSeba commented Aug 30, 2022 •

edited

Loading