Race issue after node reboot #1221

SchSeba · 2024-02-01T17:06:17Z

Hi, it looks like there is an issue after a node reboot where we can have a race in multus that will prevent the pod from starting

kubectl -n kube-system logs -f kube-multus-ds-ml62q -c install-multus-binary
cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

The problem is mainly after reboot that the multus-shim gets called by crio to start pods but the multus pod is not able to start because the init container fails to cp the shim.
The reason it failed to copy is because crio called the shim who is stuck waiting for the communication with the pod

[root@virtual-worker-0 centos]# lsof /opt/cni/bin/multus-shim
COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
multus-sh 8682 root txt    REG  252,1 46760102 46241656 /opt/cni/bin/multus-shim
[root@virtual-worker-0 centos]# ps -ef | grep mult
root        8682     936  0 16:27 ?        00:00:00 /opt/cni/bin/multus-shim
root        9082    7247  0 16:28 pts/0    00:00:00 grep --color=auto mult

The text was updated successfully, but these errors were encountered:

SchSeba · 2024-02-01T17:08:23Z

[root@virtual-worker-0 centos]# ps -ef | grep 942
root         942       1  5 17:07 ?        00:00:00 /usr/bin/crio
root        1246     942  0 17:07 ?        00:00:00 /opt/cni/bin/multus-shim
root        2745    2395  0 17:08 pts/0    00:00:00 grep --color=auto 942

from crio:

from CNI network \"multus-cni-network\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): netplugin failed with no error message: signal: killed"

SchSeba · 2024-02-01T17:21:33Z

just update doing -f looks like fix the issue in the copy command

rrpolanco · 2024-02-02T14:13:28Z

Coincidentally, we also saw this error crop up yesterday with one of our edge clusters after rebooting.

adrianchiris · 2024-02-04T14:28:56Z

As an FYI i see different deployment yamls use different way to copy the cni binary in init container:

the first one[1] will use install_multus which will copy files in an atomic manner. the latter[2] will just use cp.
(install_multus support both thick and thin plugin types)

although im not sure that copying file atomically will solve the above issue.

see:
[1]

multus-cni/deployments/multus-daemonset.yml

Line 207 in 8e5060b

command: ["/install_multus"]

and
[2]
https://github.com/k8snetworkplumbingwg/multus-cni/blob/8e5060b9a7612044b7bf927365bbdbb8f6cde451/deployments/multus-daemonset-thick.yml#L199C9-L204C46

also deployments/multus-daemonset-crio.yml does not use init contianer.

dougbtv · 2024-02-15T14:55:11Z

This should hopefully be addressed with #1213

kfox1111 · 2024-02-29T01:24:34Z

Saw this in minikube today. No rebooting, just staring up a new minikube cluster.

dougbtv · 2024-04-02T11:15:25Z

I also got a reproduction after rebooting a node and having multus restart.

I mitigated it by deleting /opt/cni/bin/multus-shim, but, yeah, I'll retest with the above patch

[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl logs kube-multus-ds-fzdcr -n kube-system
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
Error from server (BadRequest): container "kube-multus" in pod "kube-multus-ds-fzdcr" is waiting to start: PodInitializing

dustinrouillard · 2024-04-19T06:09:41Z

Seems I can make this happen anytime I ungracefully restart a node, worker or master it creates this error and stops pod network sandbox recreation completely on that node.

The fix mentioned above does work, but this likely means a power outage of a node will require manual intervention whereas otherwise without multus this is not required, this error should be handled properly.

kfox1111 · 2024-05-28T20:05:47Z

+1. This seems like a pretty serious issue. Can we get a fix merge for it soon please?

tomroffe · 2024-06-07T09:56:16Z

Additionally can confirm this behavior. as @dougbtv mentioned... removing /opt/cni/bin/multus-shim works as a workaround.

ulbi · 2024-06-15T04:57:43Z

+1 happend to me as well, cluster did not come up. Any chance to fix this soon?

stefb69 · 2024-06-18T16:56:00Z

same here, cluster kubespray 1.29

javen-yan · 2024-07-09T06:57:00Z

Certainly need to fix right away.

haiwu · 2024-08-16T22:22:45Z

@dougbtv : Hit exactly the same issue. It helps by deleting /opt/cni/bin/multus-shim.

when could this be fixed?

reski-rukmantiyo · 2024-09-04T11:33:00Z

Hit the same issue with kube-ovn. Already posted it there (kubeovn/kube-ovn#4470)
Only happen when I force delete the kube-ovn pod.

adampetrovic · 2024-10-01T03:31:18Z

Also hit me today on a node that crashed.

Any indicator this fix is going to be picked up any time soon?

iSenne · 2024-10-06T08:43:21Z

Had the same problem today, having a Talos kubernetes cluster. I modified the kube-multus-ds init containers to check for existing multus-shim file

Original command

command:
  - cp
  - /usr/src/multus-cni/bin/multus-shim
  - /host/opt/cni/bin/multus-shim

New command

command:
 - sh
 - -c
 - |
   if [ ! -f /host/opt/cni/bin/multus-shim ]; then
     cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim;
   fi

This worked for me 👍

reski-rukmantiyo · 2024-10-06T10:09:16Z

thanks @iSenne already use this code in my kubernetes cluster.
hopefuly it will fixed the issue

adampetrovic · 2024-10-06T10:26:37Z

Had the same problem today, having a Talos kubernetes cluster. I modified the kube-multus-ds init containers to check for existing multus-shim file

Original command
command:

  - cp

  - /usr/src/multus-cni/bin/multus-shim

  - /host/opt/cni/bin/multus-shim
New command
command:

 - sh

 - -c

 - |

   if [ ! -f /host/opt/cni/bin/multus-shim ]; then

     cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim;

   fi
This worked for me 👍

cp -f is arguably more correct.

Upgrading multus would mean you have an old shim file if you check for existence

oujonny · 2024-10-08T09:02:43Z

Hey we are also really blocked by this issue. What can we do to push this forward?

adampetrovic · 2024-10-08T09:08:38Z

An immediate mitigation that will get Multus running temporarily is to edit the DaemonSet directly and modify the cp command to add -f.

kubectl edit DaemonSet/multus -n <namespace>

scroll to the multus-installer initContainer
edit the cp command and add -f
cycle the pods

kfox1111 · 2024-10-08T16:59:49Z

I think the concern at this point for folks wanting to use multus, is not about having a "workaround", but the seeming inability to get a fix up-streamed, leading to questions about the health of the multus project.

#1213 for example, has been open since Jan 18, and hasn't gotten any comment since Aug 12.

Please don't see this comment as knocking the devs hard work. It is very much appreciated, really. Just trying to gauge the health of the project though.

dustinrouillard · 2024-10-08T21:27:51Z

So crazy this has been ignored by maintainers this long. 🙄

tjwallace · 2024-10-15T20:22:51Z

FYI: I made a PR to add the -f flag to the angelnu/multus chart

kub3let · 2024-11-24T15:09:01Z

This has been bothering me for quite some time, whenever I do node maintenance the whole cluster does not come up and I have to

# on each node
rm /opt/cni/bin/multus-shim

# afterwards delete the multus pods so they get redeployed
k delete -n kube-system pod/kube-multus-ds-*

SchSeba mentioned this issue Feb 4, 2024

add w/a for multus bug in CI k8snetworkplumbingwg/sriov-network-operator#612

Merged

dougbtv linked a pull request Feb 15, 2024 that will close this issue

update deploy file: use install_multus bin to update cni file #1213

Open

rbo mentioned this issue Jul 2, 2024

AAP failed/stuck job due to pod networking problem stormshift/support#187

Closed

seastco mentioned this issue Jul 19, 2024

Race condition on node startup causing Pods to get stuck in ContainerCreating #1312

Closed

haiwu mentioned this issue Sep 3, 2024

Spiderpool and Multus spidernet-io/spiderpool#3892

Open

reski-rukmantiyo mentioned this issue Oct 6, 2024

[BUG] kube-multus-ds crashloopbackoff - text file busy kubeovn/kube-ovn#4470

Closed

tjwallace mentioned this issue Oct 15, 2024

Add -f to cp command for multus-installer angelnu/helm-charts#158

Merged

5 tasks

chomatdam mentioned this issue Oct 25, 2024

Upgrade Multus to v4.x pelotech/foundation#101

Open

kub3let mentioned this issue Nov 24, 2024

Fix: prevent stuck init container due to existing/busy multus binary #1360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race issue after node reboot #1221

Race issue after node reboot #1221

SchSeba commented Feb 1, 2024

SchSeba commented Feb 1, 2024 •

edited

Loading

SchSeba commented Feb 1, 2024

rrpolanco commented Feb 2, 2024 •

edited

Loading

adrianchiris commented Feb 4, 2024 •

edited

Loading

dougbtv commented Feb 15, 2024

kfox1111 commented Feb 29, 2024

dougbtv commented Apr 2, 2024

dustinrouillard commented Apr 19, 2024

kfox1111 commented May 28, 2024

tomroffe commented Jun 7, 2024

ulbi commented Jun 15, 2024

stefb69 commented Jun 18, 2024

javen-yan commented Jul 9, 2024

haiwu commented Aug 16, 2024

reski-rukmantiyo commented Sep 4, 2024

adampetrovic commented Oct 1, 2024

iSenne commented Oct 6, 2024 •

edited

Loading

reski-rukmantiyo commented Oct 6, 2024

adampetrovic commented Oct 6, 2024 •

edited

Loading

oujonny commented Oct 8, 2024

adampetrovic commented Oct 8, 2024

kfox1111 commented Oct 8, 2024

dustinrouillard commented Oct 8, 2024

tjwallace commented Oct 15, 2024 •

edited

Loading

kub3let commented Nov 24, 2024

Race issue after node reboot #1221

Race issue after node reboot #1221

Comments

SchSeba commented Feb 1, 2024

SchSeba commented Feb 1, 2024 • edited Loading

SchSeba commented Feb 1, 2024

rrpolanco commented Feb 2, 2024 • edited Loading

adrianchiris commented Feb 4, 2024 • edited Loading

dougbtv commented Feb 15, 2024

kfox1111 commented Feb 29, 2024

dougbtv commented Apr 2, 2024

dustinrouillard commented Apr 19, 2024

kfox1111 commented May 28, 2024

tomroffe commented Jun 7, 2024

ulbi commented Jun 15, 2024

stefb69 commented Jun 18, 2024

javen-yan commented Jul 9, 2024

haiwu commented Aug 16, 2024

reski-rukmantiyo commented Sep 4, 2024

adampetrovic commented Oct 1, 2024

iSenne commented Oct 6, 2024 • edited Loading

reski-rukmantiyo commented Oct 6, 2024

adampetrovic commented Oct 6, 2024 • edited Loading

oujonny commented Oct 8, 2024

adampetrovic commented Oct 8, 2024

kfox1111 commented Oct 8, 2024

dustinrouillard commented Oct 8, 2024

tjwallace commented Oct 15, 2024 • edited Loading

kub3let commented Nov 24, 2024

SchSeba commented Feb 1, 2024 •

edited

Loading

rrpolanco commented Feb 2, 2024 •

edited

Loading

adrianchiris commented Feb 4, 2024 •

edited

Loading

iSenne commented Oct 6, 2024 •

edited

Loading

adampetrovic commented Oct 6, 2024 •

edited

Loading

tjwallace commented Oct 15, 2024 •

edited

Loading