Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drifted NodeClaims with a TerminationGracePeriod are not considered for disruption #1702

Closed
wmgroot opened this issue Sep 20, 2024 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@wmgroot
Copy link
Contributor

wmgroot commented Sep 20, 2024

Description

Observed Behavior:
NodeClaims enter a Drifted state, but fail to become disrupted.

$ kubectl get nodeclaim worker-test-disruption-xz5jd -o yaml
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "460580454793782748"
    karpenter.k8s.aws/ec2nodeclass-hash: "4188936271944326608"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "12987612610332357418"
    karpenter.sh/nodepool-hash-version: v3
    test: "1"
  creationTimestamp: "2024-09-20T20:50:38Z"
  finalizers:
  - karpenter.sh/termination
  generateName: worker-test-disruption-
  generation: 1
  labels:
    apps.indeed.com/karpenter: "true"
    apps.indeed.com/karpenter-node-pool: worker-test-disruption
    apps.indeed.com/network-zone: qa
    foo: barrrrrr
    karpenter.k8s.aws/instance-category: c
    karpenter.k8s.aws/instance-cpu: "8"
    karpenter.k8s.aws/instance-cpu-manufacturer: aws
    karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "false"
    karpenter.k8s.aws/instance-family: c6g
    karpenter.k8s.aws/instance-generation: "6"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "16384"
    karpenter.k8s.aws/instance-network-bandwidth: "2500"
    karpenter.k8s.aws/instance-size: 2xlarge
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/nodepool: worker-test-disruption
    kubernetes.io/arch: arm64
    kubernetes.io/os: linux
    node.indeed.com/env: infra
    node.kubernetes.io/instance-type: c6g.2xlarge
    role.node.kubernetes.io/worker: "true"
    test-disruption: "true"
    topology.istio.io/subzone: awscmhdev3
    topology.k8s.aws/zone-id: use2-az2
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2b
  name: worker-test-disruption-xz5jd
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: worker-test-disruption
    uid: b401cf94-ad8d-4341-9eb2-5cf97cee2d46
  resourceVersion: "871675110"
  uid: b207ad83-829a-44b2-99c0-c636dac60ad0
spec:
  expireAfter: 720h
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: worker-test-disruption
  requirements:
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: foo
    operator: In
    values:
    - barrrrrr
  - key: apps.indeed.com/karpenter-node-pool
    operator: In
    values:
    - worker-test-disruption
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values:
    - c
    - m
    - r
  - key: role.node.kubernetes.io/worker
    operator: In
    values:
    - "true"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Gt
    values:
    - "7"
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - worker-test-disruption
  - key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values:
    - "2"
  - key: apps.indeed.com/network-zone
    operator: In
    values:
    - qa
  - key: apps.indeed.com/karpenter
    operator: In
    values:
    - "true"
  - key: node.indeed.com/env
    operator: In
    values:
    - infra
  - key: test-disruption
    operator: In
    values:
    - "true"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: topology.istio.io/subzone
    operator: In
    values:
    - awscmhdev3
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c4.2xlarge
    - c5.2xlarge
    - c5a.2xlarge
    - c5a.4xlarge
    - c5ad.2xlarge
    - c5d.2xlarge
    - c5n.2xlarge
    - c6a.2xlarge
    - c6a.4xlarge
    - c6g.2xlarge
    - c6g.4xlarge
    - c6gd.2xlarge
    - c6gd.4xlarge
    - c6gn.2xlarge
    - c6i.2xlarge
    - c6id.2xlarge
    - c6in.2xlarge
    - c7a.2xlarge
    - c7g.2xlarge
    - c7g.4xlarge
    - c7gd.2xlarge
    - c7gn.2xlarge
    - c7i-flex.2xlarge
    - c7i.2xlarge
    - m4.2xlarge
    - m5.2xlarge
    - m5a.2xlarge
    - m5ad.2xlarge
    - m5d.2xlarge
    - m5dn.2xlarge
    - m5n.2xlarge
    - m6a.2xlarge
    - m6g.2xlarge
    - m6g.4xlarge
    - m6gd.2xlarge
    - m6i.2xlarge
    - m6id.2xlarge
    - m6in.2xlarge
    - m7a.2xlarge
    - m7g.2xlarge
    - m7gd.2xlarge
    - m7i-flex.2xlarge
    - m7i.2xlarge
    - r4.2xlarge
    - r5.2xlarge
    - r5a.2xlarge
    - r5ad.2xlarge
    - r5b.2xlarge
    - r5d.2xlarge
    - r5n.2xlarge
    - r6a.2xlarge
    - r6g.2xlarge
    - r6gd.2xlarge
    - r6i.2xlarge
    - r6id.2xlarge
    - r7a.2xlarge
    - r7g.2xlarge
    - r7gd.2xlarge
    - r7i.2xlarge
    - r8g.2xlarge
  resources:
    requests:
      cpu: 3157m
      ephemeral-storage: 1752Mi
      memory: "6612603620"
      pods: "15"
  startupTaints:
  - effect: NoSchedule
    key: node.cilium.io/agent-not-ready
  taints:
  - effect: NoSchedule
    key: test-disruption
    value: "true"
  terminationGracePeriod: 24h0m0s
status:
  allocatable:
    cpu: 7700m
    ephemeral-storage: 213504Mi
    memory: "14828122931"
    pods: "43"
    vpc.amazonaws.com/pod-eni: "38"
  capacity:
    cpu: "8"
    ephemeral-storage: 250Gi
    memory: 15096Mi
    pods: "43"
    vpc.amazonaws.com/pod-eni: "38"
  conditions:
  - lastTransitionTime: "2024-09-20T21:01:09Z"
    message: ""
    reason: ConsistentStateFound
    status: "True"
    type: ConsistentStateFound
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Consolidatable
    status: "True"
    type: Consolidatable
  - lastTransitionTime: "2024-09-20T21:02:18Z"
    message: NodePoolDrifted
    reason: NodePoolDrifted
    status: "True"
    type: Drifted
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Initialized
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-09-20T20:50:40Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-09-20T20:51:32Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-20T20:51:01Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered
  imageID: ami-06ab2496c2a927ca6
  nodeName: ip-10-115-210-142.us-east-2.compute.internal
  providerID: aws:///us-east-2b/i-0625839770e0cf9bf

No volumeattachments are involved with this issue.

$ kubectl get volumeattachments | grep ip-10-115-210-142.us-east-2.compute.internal
<nothing>

My node says that disruption is blocked due to a pending pod, but I have no pending pods in my cluster, and the node in question has a taint to allow only a single do-not-disrupt pod to schedule there as a test case.

$ kubectl describe node ip-10-115-210-142.us-east-2.compute.internal

Taints:             test-disruption=true:NoSchedule
Unschedulable:      false

Events:
  Type     Reason                   Age                   From                   Message
  ----     ------                   ----                  ----                   -------
  Normal   NodeHasSufficientMemory  47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeAllocatableEnforced  47m                   kubelet                Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientPID     47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasNoDiskPressure    47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   Starting                 47m                   kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      47m                   kubelet                invalid capacity 0 on image filesystem
  Normal   RegisteredNode           47m                   node-controller        Node ip-10-115-210-142.us-east-2.compute.internal event: Registered Node ip-10-115-210-142.us-east-2.compute.internal in Controller
  Normal   Synced                   47m                   cloud-node-controller  Node synced successfully
  Normal   DisruptionBlocked        47m                   karpenter              Cannot disrupt Node: state node isn't initialized
  Normal   NodeReady                47m                   kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        39m (x4 over 45m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        37m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        30m (x4 over 36m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        29m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        4m58s (x13 over 29m)  karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        45s (x2 over 2m46s)   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
$ kubectl get pod -n default -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
hello-world-nginx-5964768b4c-fnrxp   1/1     Running   0          77m   10.75.173.246   ip-10-115-217-230.us-east-2.compute.internal   <none>           <none>

$ kubectl get pod -n default hello-world-nginx-5964768b4c-fnrxp -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"
  creationTimestamp: "2024-09-20T20:49:39Z"
  generateName: hello-world-nginx-5964768b4c-
  labels:
    app: hello-world-nginx
    pod-template-hash: 5964768b4c
  name: hello-world-nginx-5964768b4c-fnrxp
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: hello-world-nginx-5964768b4c
    uid: 19b2047d-90ab-494d-ac65-94f2ad42971d
  resourceVersion: "871647910"
  uid: 4748ed5b-1cb5-4c00-a5f7-996e73c5ce97
spec:
  containers:
  - image: docker-registry.awscmhdev3.k8s.indeed.tech/dockerhub-proxy/nginxinc/nginx-unprivileged:1.23-alpine
    imagePullPolicy: IfNotPresent
    name: hello-world-nginx
    ports:
    - containerPort: 8080
      protocol: TCP
    resources:
      requests:
        cpu: "2"
        memory: 4Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/nginx
      name: nginx-conf
      readOnly: true
    - mountPath: /var/log/nginx
      name: log
    - mountPath: /var/cache/nginx
      name: cache
    - mountPath: /var/run
      name: run
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5mthz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-115-217-230.us-east-2.compute.internal
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
    test-disruption: "true"
  preemptionPolicy: Never
  priority: 100
  priorityClassName: tier-3
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: test-disruption
    value: "true"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      items:
      - key: nginx.conf
        path: nginx.conf
      name: hello-world-nginx
    name: nginx-conf
  - emptyDir: {}
    name: log
  - emptyDir: {}
    name: cache
  - emptyDir: {}
    name: run
  - name: kube-api-access-5mthz
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Expected Behavior:
Nodes with a TerminationGracePeriod set that include do-not-disrupt or PDB-blocked pods are able to be disrupted due to NodeClaim drift and are eventually drained. A new NodeClaim is created immediately once disruption of the old NodeClaim begins.

Reproduction Steps (Please include YAML):

Versions: 1.0.1

  • Chart Version: 1.0.1
  • Kubernetes Version (kubectl version): 1.29
$ kubectl version
Client Version: v1.25.1
Kustomize Version: v4.5.7
Server Version: v1.29.7-eks-2f46c53
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@wmgroot wmgroot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 20, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wmgroot
Copy link
Contributor Author

wmgroot commented Sep 20, 2024

For anyone experiencing similar symptoms without TerminationGracePeriod, check to see if you might be encountering a problem with volumemountattachments described in this issue.
#1684

@njtran
Copy link
Contributor

njtran commented Sep 20, 2024

Adding correspondence from slack:

Two potential influencing factors here:

  • We wait for full instance termination before removing a node at v1. This impacts NodePool Disruption Budgets
  • Karpenter includes pods on deleting nodes in scheduling simulations, so pods that are do-not-disrupt on a node that is deleting are including in as scheduling simulations, assuming that the node will be cleaned up soon. Karpenter could think that these pods are "rescheduling" to a drifted node, and thus not consider the node a candidate for disruption.

I'm guessing you're more impacted by #2 based on the events I see. This is likely a difference from v0.37 since we didn't have TGP and didn't enqueue nodes for deletion that had do-not-disrupt/pdb blocking pods in the first place, making the likelihood that you had indefinitely draining nodes higher. Not to mention that we also now block eviction on do-not-disrupt so the average drain time might be higher than in v0.37

If I had to guess, the best way to fix this would be for us to solve our preferences story (#666), and add a PreferNoSchedule taint for drifted nodes. We discussed adding that taint, but we didn't have a consistent story around how both consolidation/disruption/provisioning could all align so we don't get any flapping issues.

@njtran
Copy link
Contributor

njtran commented Sep 20, 2024

More correspondence in slack: (of which i know some is included here already)
Yeah we should definitely dive into it, once again these were just theories. Can you open the issue with the logs/events and check which pods are being considered here? There should be pod nomination events and node nomination events.
It'd be interesting to see the following:

  • if the pod is scheduled to a node that is draining
  • if it's the same pod that is blocking another node from being drifted
  • if the pod has a do-not-disrupt or pdb

@njtran njtran self-assigned this Sep 20, 2024
@njtran
Copy link
Contributor

njtran commented Oct 7, 2024

@wmgroot any thoughts here? did you get a chance to validate what I was saying?

@wmgroot
Copy link
Contributor Author

wmgroot commented Oct 8, 2024

@cnmcavoy identified a bug in Karpenter's logic that tracks nodes marked for deletion. There's an error case which can fail to unmark a marked node, resulting in disruption budgets being reached while no progress can be made. We've got a patch that we've been testing for the last week and plan to open a PR for soon.

We think that TGP is not directly related to this problem, but was exacerbating the issue since nodes in a terminating state take up space in the disruption budget while they're pending termination.

@wmgroot
Copy link
Contributor Author

wmgroot commented Oct 8, 2024

Ultimately the bug was introduced in some of our patched code to address issues with single and multi-node consolidation. We plan to work further with the maintainers on improvements to consolidation to avoid the need to run a forked version of Karpenter.

After addressing the bug in our patch, we have seen our disruption frequencies and cluster scale return to pre-v1 levels. We'll re-open or create a new issue if we notice anything else amiss with TGP and drift disruption.

@wmgroot wmgroot closed this as completed Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants