Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Claim failing - Private AKS cluster BYO Subnet #459

Open
krishnpr opened this issue Aug 14, 2024 · 2 comments
Open

Node Claim failing - Private AKS cluster BYO Subnet #459

krishnpr opened this issue Aug 14, 2024 · 2 comments
Assignees
Labels
area/networking Issues or PRs related to networking needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@krishnpr
Copy link

Version

Karpenter Version: v0.5.1

Kubernetes Version: v1.28.10

Expected Behavior

It should claim node, without any error.

https://github.com/Azure/karpenter-provider-azure/blob/main/README.md#create-nodepool
https://github.com/Azure/karpenter-provider-azure/blob/main/README.md#scale-up-deployment

Actual Behavior

Its keep creating node and deleting it, unable to update following values -
customData
zone

Steps to Reproduce the Problem

Followed the steps by step as per Readme.

Resource Specs and Logs

{"level":"DEBUG","time":"2024-08-14T17:17:57.880Z","logger":"controller.provisioner","message":"25 out of 398 instance types were excluded because they would breach limits","commit":"46b4276","nodepool":"general-purpose"}
{"level":"INFO","time":"2024-08-14T17:17:57.883Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"46b4276","pods":"default/inflate-5f57665f58-66xht","duration":"14.67177ms"}
{"level":"INFO","time":"2024-08-14T17:17:57.883Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"46b4276","nodeclaims":1,"pods":1}
{"level":"DEBUG","time":"2024-08-14T17:17:57.884Z","logger":"controller.provisioner","message":"25 out of 398 instance types were excluded because they would breach limits","commit":"46b4276","nodepool":"general-purpose"}
{"level":"INFO","time":"2024-08-14T17:17:57.887Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"46b4276","pods":"default/inflate-5f57665f58-66xht","duration":"20.233186ms"}
{"level":"INFO","time":"2024-08-14T17:17:57.887Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"46b4276","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-08-14T17:17:57.900Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"46b4276","nodepool":"general-purpose","nodeclaim":"general-purpose-s5mr4","requests":{"cpu":"1400m","memory":"1020Mi","pods":"7"},"instance-types":"Standard_D11_v2, Standard_D12_v2, Standard_D13_v2, Standard_D16ls_v5, Standard_D2_v2 and 55 other(s)"}
{"level":"INFO","time":"2024-08-14T17:17:57.912Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"46b4276","nodepool":"general-purpose","nodeclaim":"general-purpose-jcswr","requests":{"cpu":"1400m","memory":"1020Mi","pods":"7"},"instance-types":"Standard_D11_v2, Standard_D12_v2, Standard_D13_v2, Standard_D2_v2, Standard_D2_v3 and 55 other(s)"}
{"level":"INFO","time":"2024-08-14T17:17:57.920Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"INFO","time":"2024-08-14T17:17:57.921Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202408.12.0 for instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.921Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 1 IPv4 backend pools: [/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.921Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-general-purpose-s5mr4","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"INFO","time":"2024-08-14T17:17:57.932Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"INFO","time":"2024-08-14T17:17:57.932Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202408.12.0 for instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.932Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 1 IPv4 backend pools: [/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.932Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-general-purpose-jcswr","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"INFO","time":"2024-08-14T17:17:57.953Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"INFO","time":"2024-08-14T17:17:57.955Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202408.12.0 for instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.956Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 1 IPv4 backend pools: [/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.956Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-general-purpose-s5mr4","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"INFO","time":"2024-08-14T17:17:57.963Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"INFO","time":"2024-08-14T17:17:57.963Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202408.12.0 for instance type Standard_D2ls_v5","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.963Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 1 IPv4 backend pools: [/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:17:57.963Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-general-purpose-jcswr","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:18:02.962Z","logger":"controller.nodeclaim.lifecycle","message":"Successfully created network interface: /subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-jcswr","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:18:02.962Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine aks-general-purpose-jcswr (Standard_D2ls_v5)","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:18:03.147Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}
{"level":"DEBUG","time":"2024-08-14T17:18:03.180Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}
{"level":"DEBUG","time":"2024-08-14T17:18:03.287Z","logger":"controller.nodeclaim.lifecycle","message":"Successfully created network interface: /subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-s5mr4","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"DEBUG","time":"2024-08-14T17:18:03.287Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine aks-general-purpose-s5mr4 (Standard_D2ls_v5)","commit":"46b4276","nodeclaim":"general-purpose-s5mr4"}
{"level":"ERROR","time":"2024-08-14T17:18:03.740Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine "aks-general-purpose-jcswr" failed: PUT https://management.azure.com/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-jcswr\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n "error": {\n "details": [],\n "code": "NotFound",\n "message": "Resource /subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-jcswr not found."\n }\n}\n--------------------------------------------------------------------------------\n","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"ERROR","time":"2024-08-14T17:18:03.892Z","logger":"controller.nodeclaim.lifecycle","message":"networkInterface.Delete for aks-general-purpose-jcswr failed: GET https://management.azure.com/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-jcswr\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "NotFound",\n "message": "Resource /subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-jcswr not found.",\n "details": []\n }\n}\n--------------------------------------------------------------------------------\n","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"ERROR","time":"2024-08-14T17:18:03.892Z","logger":"controller.nodeclaim.lifecycle","message":"failed to cleanup resources for node claim general-purpose-jcswr, %!w(*errors.joinError=&{[0xc0018ffc00]})","commit":"46b4276","nodeclaim":"general-purpose-jcswr"}
{"level":"DEBUG","time":"2024-08-14T17:18:04.147Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}
{"level":"DEBUG","time":"2024-08-14T17:18:04.181Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}
{"level":"ERROR","time":"2024-08-14T17:18:04.920Z","logger":"controller","message":"Reconciler error","commit":"46b4276","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"general-purpose-jcswr"},"namespace":"","name":"general-purpose-jcswr","reconcileID":"280cbf2e-6f89-4c35-9f43-0cce6f7bf481","error":"launching nodeclaim, creating instance, virtualMachine.BeginCreateOrUpdate for VM "aks-general-purpose-jcswr" failed: PUT https://management.azure.com/subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-jcswr\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n "error": {\n "details": [],\n "code": "NotFound",\n "message": "Resource /subscriptions/xxxxxxxxxxxxxxxx/resourceGroups/mc_xxxxxxxxxxxxxxxx/providers/Microsoft.Network/networkInterfaces/aks-general-purpose-jcswr not found."\n }\n}\n--------------------------------------------------------------------------------\n"}
{"level":"DEBUG","time":"2024-08-14T17:18:05.148Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}
{"level":"DEBUG","time":"2024-08-14T17:18:05.182Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"46b4276"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@tallaxes tallaxes added area/networking Issues or PRs related to networking needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 30, 2024
@h4wkmoon
Copy link

Karpenter wants the CLUSTER_ENDPOINT but for private clusters, this DNS record can't be resolved from nodes (seen in kubelet's logs)
AKS itself uses the cluster private endpoint IP, not the fqdn.

IMO, the main issue is that Karpenter requires the CLUSTER_ENDPOINT hostname part to be more than 33c. For private cluters, the first 33c mean nothing, and this DNS record can't be used from nodes.

As ugly as it is, I tried to manually changed the cluster endpoint in the files /var/lib/kubelet/bootstrap-kubeconfig and /var/lib/kubelet/kubeconfig
I've set it to the IP of the apiserver, just AKS does.
It works: the node registers, works, and stays.

This should proves that karpenter can work with private clusters, if the 33c constraint is removed.

@Bryce-Soghigian Bryce-Soghigian self-assigned this Oct 21, 2024
@krishnpr
Copy link
Author

I am unable to reach a stable state because the node is continuously being created and reclaimed. This leads to a loop where the nodes, along with related devices like network interfaces and storage, are repeatedly created and deleted. The only way to stop this loop is by scaling down to 0 or deleting the pods/deployments.

Additionally, we're using this feature for multi-tenant clusters, and I’m unsure how DNS could be contributing to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Issues or PRs related to networking needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants