Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

willthames · 2024-09-18T03:12:41Z

Description

Observed Behavior:

Two nodes were replaced during drift replacement, next one seems stuck with

  Normal   DisruptionBlocked      19s (x29 over 57m)  karpenter  Cannot disrupt Node: state node is marked for deletion

There is nothing in karpenter's logs to explain this. We did see similar behaviour during the karpenter 1.0.1 upgrade but put that down to API version mismatches but we don't seem to have any such mismatches this time.

k get nodeclaims -o custom-columns='APIVER:.apiVersion,NAME:.metadata.name,OWNER_API_VER:.metadata.ownerReferences[0].apiVersion,OWNERKIND:.metadata.ownerReferences[0].kind'
APIVER            NAME                               OWNER_API_VER     OWNERKIND
karpenter.sh/v1   bottlerocket-general-amd64-6k89s   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-cmdfv   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-hw9p6   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-jjmnz   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-ml9q4   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-z7r8c   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-smaller-amd64-gc2nt   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-smaller-amd64-sr2lr   karpenter.sh/v1   NodePool

Expected Behavior:

All nodes get replaced during drift replacement

Reproduction Steps (Please include YAML):

Versions:

Chart Version: 1.0.2
Kubernetes Version (kubectl version):

Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.30.3-eks-2f46c53

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-09-18T03:12:49Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

willthames · 2024-09-19T05:15:07Z

I've done some more digging on this and can add some additional information.

The node and nodeclaim both have deletionTimestamp set, which means they're just waiting for their finalizers to be removed before entering Terminating.

Looking at the code it seems that the node finalizer ensures that all the nodeclaims related to the node are deleted before it finishes by calling cloudProvider.Delete for the nodeclaim, then the node will finalize and terminate and then the nodeclaims will finalize and terminate.

I can't see any evidence that the deletion request has made it to the cloud (I did a athena search over cloudtrail for all non read-only requests in ec2 in the region, no calls to TerminateInstances (or similar) were made) so it seems to be something in between.

The only thing I can see that would cause the cloud termination never to be reached, without any errors being logged, is if one of the two actions that cause a reconcile requeue never finish - namely the drain, or the volume attachment tidy up.

Actually it can't be the drain, because a failed drain causes a node event to be published.

leoryu · 2024-09-19T08:30:19Z

Are you using aws provider? Karpenter will create an new nodeclaim firstly to ensure the pods can be scheduled. Cloud you see your controller logs to check whether the provider is trying to create an nodeclaim or not?

willthames · 2024-09-20T04:31:46Z

It is the volume attachments.

If I do kubectl get volumeattachments there is still a volume attached to the node.

However, the persistent volume associated with the attachment is in Released status (the PVC, and the pod that was associated with the claim, no longer exist, presumably having been drained).

I know that the fix is to ignore released attachments in filterVolumeAttachments, I'm just struggling to create a test case that fails without the fix and passes with the fix!

willthames · 2024-09-20T05:06:31Z

An easy way to validate that this was the problem (in hindsight, obviously!) is that
kubectl delete volumeattachment on the volume attachment associated with the stuck node causes the node to finally terminate.

wmgroot · 2024-09-20T22:08:14Z

We are seeing similar behavior with karpenter v1, nodeclaims stuck in a drifted state without ever being disrupted.

In my case, I have a node with no volumeattachments

$ kubectl get volumeattachments | grep ip-10-115-210-142.us-east-2.compute.internal
<nothing>

My node says that disruption is blocked due to a pending pod, but I have no pending pods in my cluster, and the node in question has a taint to allow only a single do-not-disrupt pod to schedule there as a test case.

$ kubectl describe node ip-10-115-210-142.us-east-2.compute.internal

Taints:             test-disruption=true:NoSchedule
Unschedulable:      false

Events:
  Type     Reason                   Age                   From                   Message
  ----     ------                   ----                  ----                   -------
  Normal   NodeHasSufficientMemory  47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeAllocatableEnforced  47m                   kubelet                Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientPID     47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasNoDiskPressure    47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   Starting                 47m                   kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      47m                   kubelet                invalid capacity 0 on image filesystem
  Normal   RegisteredNode           47m                   node-controller        Node ip-10-115-210-142.us-east-2.compute.internal event: Registered Node ip-10-115-210-142.us-east-2.compute.internal in Controller
  Normal   Synced                   47m                   cloud-node-controller  Node synced successfully
  Normal   DisruptionBlocked        47m                   karpenter              Cannot disrupt Node: state node isn't initialized
  Normal   NodeReady                47m                   kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        39m (x4 over 45m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        37m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        30m (x4 over 36m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        29m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        4m58s (x13 over 29m)  karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        45s (x2 over 2m46s)   karpenter              Cannot disrupt Node: state node is nominated for a pending pod

$ kubectl get pod -n default -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
hello-world-nginx-5964768b4c-fnrxp   1/1     Running   0          77m   10.75.173.246   ip-10-115-217-230.us-east-2.compute.internal   <none>           <none>

I am using terminationGracePeriod on this nodeclaim, and I expect that disruption via drift should make the node unschedulable and create a new nodeclaim for pods to reschedule on.

wmgroot · 2024-09-20T23:19:51Z

Opened up a new issue here since my problem looks to be unrelated to the issue with volumeattachments, even though it results in similar behavior.
#1702

willthames · 2024-09-23T07:24:01Z

Thanks to @AndrewSirenko for providing some valuable insight in #1700 by suggesting that maybe the CSI drivers weren't handling volume detachment correctly. I had also missed that only the EBS CSI volumes were affected, and the EFS volumes were being handled fine.

I decided to check the EBS controller logs during node termination, only to discover I no longer had an EBS controller on the node, because the EBS node daemonset didn't tolerate the termination taints.

Once I changed the tolerations so that the EBS node controller remained alive during termination that meant that the volumes could be cleaned up appropriately and the drift replacement now works perfectly again.

It looks like #1294 was released with 1.0.1 and was a breaking change for us due to our incorrect EBS CSI configuration which we'd previously got away with!

willthames · 2024-09-23T07:25:08Z

Closing this now

AndrewSirenko · 2024-09-24T00:11:43Z

I decided to check the EBS controller logs during node termination, only to discover I no longer had an EBS controller on the node, because the EBS node daemonset didn't tolerate the termination taints.

Glad you root caused this @willthames, and thanks for sharing this tricky failure mode. I'll make sure we over at EBS CSI Driver add this to some kind of Karpenter + EBS CSI FAQ/Troubleshooting guide.

Just curious, but what version of the EBS CSI Driver were you running? v1.29.0 added a check for the the karpenter.sh/disrupting taint in PR#1969, when karpenter changed to a custom taint last year. If your installation ≥ v1.29.0 then this is something for the EBS team to investigate... Cheers!

willthames · 2024-09-24T00:34:45Z

I've just checked the running version in as yet unfixed cluster, it's v1.34.0 - so the version shouldn't be a problem (we have a github action that regularly checks our helm charts and bumps them so we're rarely too far off the leading edge)

I'll validate that the correct taints are being applied and watched for when I apply the AMI bump to our remaining cluster

willthames · 2024-09-24T01:05:07Z

@AndrewSirenko I've raised kubernetes-sigs/aws-ebs-csi-driver#2158 now - it seems that the taint has changed with v1 to karpenter.sh/disrupted

willthames added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 18, 2024

willthames mentioned this issue Sep 20, 2024

fix: Ignore released volumes when deleting a node #1699

Closed

willthames mentioned this issue Sep 20, 2024

Only wait for volume attachments for drainable nodes #1700

Closed

njtran assigned jmdeal Sep 20, 2024

wmgroot mentioned this issue Sep 20, 2024

Drifted NodeClaims with a TerminationGracePeriod are not considered for disruption #1702

Closed

willthames closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

willthames commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

willthames commented Sep 19, 2024

leoryu commented Sep 19, 2024

willthames commented Sep 20, 2024

willthames commented Sep 20, 2024

wmgroot commented Sep 20, 2024

wmgroot commented Sep 20, 2024

willthames commented Sep 23, 2024

willthames commented Sep 23, 2024

AndrewSirenko commented Sep 24, 2024

willthames commented Sep 24, 2024

willthames commented Sep 24, 2024

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

Comments

willthames commented Sep 18, 2024

Description

k8s-ci-robot commented Sep 18, 2024

willthames commented Sep 19, 2024

leoryu commented Sep 19, 2024

willthames commented Sep 20, 2024

willthames commented Sep 20, 2024

wmgroot commented Sep 20, 2024

wmgroot commented Sep 20, 2024

willthames commented Sep 23, 2024

willthames commented Sep 23, 2024

AndrewSirenko commented Sep 24, 2024

willthames commented Sep 24, 2024

willthames commented Sep 24, 2024