Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

darklight147 · 2024-07-11T22:41:35Z

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.0.0

Expected Behavior

Create a new node

Actual Behavior

Show the above message when describing a Pod but doesn't create any new Nodes

Steps to Reproduce the Problem

AKS Cluster with node auto provisioning enabled

Scale a deployment Nginx for example to 20 with memory request 8Gi

Resource Specs and Logs

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "12393960163388511505"
    karpenter.sh/nodepool-hash-version: v2
    kubernetes.io/description: General purpose NodePool for generic workloads
    meta.helm.sh/release-name: aks-managed-karpenter-overlay
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-07-11T00:34:44Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: karpenter-overlay-main-adapter-helmrelease
    helm.toolkit.fluxcd.io/namespace: 668f23a48709cf00012ccf73
  name: default
  resourceVersion: "485239"
  uid: 19ab7243-b704-4a1c-b9d8-279243f12865
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
status:
  resources:
    cpu: "16"
    ephemeral-storage: 128G
    memory: 106086Mi
    pods: "110"

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Bryce-Soghigian · 2024-07-11T22:45:27Z

Looking at the cluster id you shared, I see logs like creating instance, insufficient capacity, regional on-demand vCPU quota limit for subscription has been reached. To scale beyond this limit, please review the quota increase process here: https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

If you do kubectl get events | grep karp, do you see any events like this?

Bryce-Soghigian · 2024-07-11T22:46:16Z

You can unblock your scaleup by requesting additional quota on the subscription for that region via following the steps in
this link https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

darklight147 · 2024-07-12T19:13:34Z

@Bryce-Soghigian
Empty events from the command

Also here is the current Quota sorted by Current usage

darklight147 · 2024-07-15T21:31:07Z

@Bryce-Soghigian hey any update on this? thank you 🚀

maulik13 · 2024-08-14T08:25:16Z

We are seeing a similar behaviour in our cluster. Node claims are created, but they do not get in to the ready state. We do not see anything special in the events. It only says "Pod should schedule on: nodeclaim/app-g5gcw" and "Cannot disrupt NodeClaim" for existing nodes.

We have also checked that we have not reached our quota.

maulik13 · 2024-08-14T09:00:05Z

Ref: #438 running az aks update -n cluster -g rg fixed the issue for us.

maulik13 · 2024-11-27T10:23:08Z

We are still seeing this behavior time to time where nodeclaims are created resulting in creation of new VMs, but they do not manage to join the cluster and get in to a Ready = true state. How do we debug this or provide you with logs? The only solution is to reconcile the state by running an empty update against the cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

darklight147 commented Jul 11, 2024

Bryce-Soghigian commented Jul 11, 2024 •

edited

Loading

Bryce-Soghigian commented Jul 11, 2024 •

edited

Loading

darklight147 commented Jul 12, 2024 •

edited

Loading

darklight147 commented Jul 15, 2024

maulik13 commented Aug 14, 2024

maulik13 commented Aug 14, 2024

maulik13 commented Nov 27, 2024

Karpenter stuck with pod Pod should schedule on: nodeclaim/... #432

Karpenter stuck with pod Pod should schedule on: nodeclaim/... #432

Comments

darklight147 commented Jul 11, 2024

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

Bryce-Soghigian commented Jul 11, 2024 • edited Loading

Bryce-Soghigian commented Jul 11, 2024 • edited Loading

darklight147 commented Jul 12, 2024 • edited Loading

darklight147 commented Jul 15, 2024

maulik13 commented Aug 14, 2024

maulik13 commented Aug 14, 2024

maulik13 commented Nov 27, 2024

Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

Bryce-Soghigian commented Jul 11, 2024 •

edited

Loading

Bryce-Soghigian commented Jul 11, 2024 •

edited

Loading

darklight147 commented Jul 12, 2024 •

edited

Loading