Skip to content

Commit 2418ea1

Browse files
Merge pull request #10200 from ryanberg-aquent/nap-tsg-update
AB#8179: PR#1947 - Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS) (Nap tsg update migration from public to private repo)
2 parents 3109126 + 45e484c commit 2418ea1

File tree

2 files changed

+375
-0
lines changed

2 files changed

+375
-0
lines changed
Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
---
2+
title: Troubleshoot Node Auto-Provisioning Managed Add-on
3+
description: Learn how to troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS).
4+
ms.service: azure-kubernetes-service
5+
author: JarrettRenshaw
6+
ms.author: jarrettr
7+
manager: dcscontentpm
8+
ms.topic: troubleshooting
9+
ms.date: 09/05/2025
10+
editor: bsoghigian
11+
ms.reviewer: phwilson, v-ryanberg, v-gsitser
12+
#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve node auto-provisioning managed add-ons so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS).
13+
ms.custom: sap:Extensions, Policies and Add-Ons
14+
---
15+
16+
# Troubleshoot node auto-provisioning (NAP) in Azure Kubernetes Service (AKS)
17+
18+
This article discusses how to troubleshoot node auto-provisioning (NAP). NAP is a managed add-on that's based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine (VM) or node level.
19+
20+
When you enable NAP, you might encounter issues that are associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common issues that affect NAP but aren't covered in the Karpenter [FAQ][karpenter-faq] or [troubleshooting guide][karpenter-troubleshooting].
21+
22+
## Prerequisites
23+
24+
Make sure that the following tools are installed and configured:
25+
26+
- [Azure Command-Line Interface (CLI)](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the `[az aks install-cli](/cli/azure/aks#az-aks-install-cli)` command.
27+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client that's available together with Azure CLI.
28+
- NAP, enabled on your cluster. For more information, see [node auto provisioning documentation][nap-main-docs].
29+
30+
## Common issues
31+
32+
### Nodes aren't removed
33+
34+
**Symptoms**
35+
36+
Underused or empty nodes remain in the cluster longer than you expect.
37+
38+
**Debugging steps**
39+
40+
1. **Check node usage**
41+
42+
Run the following command:
43+
44+
```azurecli-interactive
45+
kubectl top nodes
46+
kubectl describe node <node-name>
47+
```
48+
49+
You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage.
50+
51+
2. **Look for blocking pods**
52+
53+
Run the following command:
54+
55+
```azurecli-interactive
56+
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
57+
```
58+
59+
3. **Check for disruption blocks**
60+
61+
Run the following command:
62+
63+
```azurecli-interactive
64+
kubectl get events | grep -i "disruption\|consolidation"
65+
```
66+
67+
**Cause**
68+
69+
Common causes include:
70+
71+
- Pods that have no proper tolerations
72+
- DaemonSets that prevent drain
73+
- Pod disruption budgets (PDBs) that aren't correctly set
74+
- Nodes that are marked by a `do-not-disrupt` annotation
75+
- Locks that block changes
76+
77+
**Solution**
78+
79+
Possible solutions include:
80+
81+
- Add proper tolerations to pods.
82+
- Review `DaemonSet` configurations.
83+
- Adjust PDBs to allow disruption.
84+
- Remove the `do-not-disrupt` annotations, as appropriate.
85+
- Review lock configurations.
86+
87+
## Networking issues
88+
89+
For most networking-related issues, you can use either of the available levels of networking observability:
90+
91+
- [Container network metrics][aks-container-metrics] (default): Enables observation of node-level metrics.
92+
- [Advanced container network metrics][advanced-container-network-metrics]: Enables observation of pod-level metrics, including fully qualified domain name (FQDN) metrics for troubleshooting.
93+
94+
### Pod connectivity issues
95+
96+
**Symptoms**
97+
98+
Pods can't communicate with other pods or external services.
99+
100+
**Debugging steps**
101+
102+
1. **Test basic connectivity**
103+
104+
Run the following command:
105+
106+
```azurecli-interactive
107+
# From within a pod
108+
kubectl exec -it <pod-name> -- ping <target-ip>
109+
kubectl exec -it <pod-name> -- nslookup kubernetes.default
110+
```
111+
112+
Another option is to use the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool.
113+
114+
2. **Check network plugin status**
115+
116+
Run the following command:
117+
118+
```azurecli-interactive
119+
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
120+
```
121+
122+
If you're using Azure Container Networking Interface (CNI) in overlay mode, verify that your nodes have these labels:
123+
124+
```azurecli-interactive
125+
kubernetes.azure.com/azure-cni-overlay: "true"
126+
kubernetes.azure.com/network-name: aks-vnet-<redacted>
127+
kubernetes.azure.com/network-resourcegroup: <redacted>
128+
kubernetes.azure.com/network-subscription: <redacted>
129+
```
130+
131+
4. **Verify the CNI configuration files**
132+
133+
The CNI conflist files define network plugin configurations. Check which files are present:
134+
135+
```azurecli-interactive
136+
# List CNI configuration files
137+
ls -la /etc/cni/net.d/
138+
139+
# Example output:
140+
# 10-azure.conflist 15-azure-swift-overlay.conflist
141+
```
142+
143+
**Understanding conflist files**
144+
145+
This scenario includes two kinds of conflist files:
146+
147+
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking of all CNIs that don't use overlay mode.
148+
- `15-azure-swift-overlay.conflist`: Azure CNI Overlay networking (used by Cilium or in overlay mode).
149+
150+
**Inspect the configuration content**
151+
152+
Run the following command:
153+
154+
```azurecli-interactive
155+
# Check the actual CNI configuration
156+
cat /etc/cni/net.d/*.conflist
157+
158+
# Look for key fields:
159+
# - "type": should be "azure-vnet" for Azure CNI
160+
# - "mode": "bridge" for standard, "transparent" for overlay
161+
# - "ipam": IP address management configuration
162+
```
163+
164+
**Common conflist issues**
165+
166+
Common conflist issues include:
167+
168+
- Missing or corrupted configuration files
169+
- Incorrect network mode for your cluster setup
170+
- Mismatched IP Address Management (IPAM) configuration
171+
- Wrong plugin order in the configuration chain
172+
173+
5. **Check CNI-to-Advanced Container Networking Services (ACNS) communication**
174+
175+
Run the following command:
176+
177+
```azurecli-interactive
178+
# Check CNS logs for IP allocation requests from CNI
179+
kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100
180+
```
181+
182+
**CNI-to-ACNS troubleshooting**
183+
184+
- **If ACNS logs show "no IPs available"**: Indicates an ACNS or AKS watch that's enacted on the Neural Network Coding (NNC).
185+
- **If CNI calls don't appear in ACNS logs**: Usually indicates that the wrong CNI is installed. Verify that the correct CNI plugin is deployed.
186+
187+
**Causes**
188+
189+
Common causes include:
190+
191+
- Network security group (NSG) rules
192+
- Incorrect subnet configuration
193+
- CNI plugin issues
194+
- DNS resolution problems
195+
196+
**Solution**
197+
198+
Possible solutions include:
199+
200+
- Review the [Network Security Group][network-security-group-docs] rules for required traffic.
201+
- Verify the subnet configuration in `AKSNodeClass`. For more information, see [AKSNodeClass documentation][aksnodeclass-subnet-config].
202+
- Restart the CNI plugin pods.
203+
- Check the `CoreDNS` configuration. For more information, see [CoreDNS documentation][coredns-troubleshoot].
204+
205+
### DNS service IP issues
206+
207+
> [!NOTE]
208+
> The `--dns-service-ip` parameter is supported for only NAP clusters and isn't available for self-hosted Karpenter installations.
209+
210+
**Symptoms**
211+
212+
Pods can't resolve DNS names or kubelet doesn't register together with the API server because of DNS resolution failures.
213+
214+
**Debugging steps**
215+
216+
1. **Check kubelet DNS configuration**
217+
218+
Run the following command:
219+
220+
```azurecli-interactive
221+
# SSH to the Karpenter node and check kubelet config
222+
sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS
223+
224+
# Expected output should show the correct DNS service IP
225+
# clusterDNS:
226+
# - "10.0.0.10" # This should match your cluster's DNS service IP
227+
```
228+
229+
2. **Verify DNS service IP matches cluster configuration**
230+
231+
Run the following command:
232+
233+
```azurecli-interactive
234+
# Get the actual DNS service IP from your cluster
235+
kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
236+
237+
# Compare with what AKS reports
238+
az aks show --resource-group <rg> --name <cluster-name> --query "networkProfile.dnsServiceIp" -o tsv
239+
```
240+
241+
3. **Test DNS resolution from the node**
242+
243+
Run the following command:
244+
245+
```azurecli-interactive
246+
# SSH to the Karpenter node and test DNS resolution
247+
# Test using the DNS service IP directly
248+
dig @10.0.0.10 kubernetes.default.svc.cluster.local
249+
250+
# Test using system resolver
251+
nslookup kubernetes.default.svc.cluster.local
252+
253+
# Test external DNS resolution
254+
dig azure.com
255+
```
256+
257+
4. **Check DNS pods status**
258+
259+
Run the following command:
260+
261+
```azurecli-interactive
262+
# Verify CoreDNS pods are running
263+
kubectl get pods -n kube-system -l k8s-app=kube-dns
264+
265+
# Check CoreDNS logs for errors
266+
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
267+
```
268+
269+
5. **Verify network connectivity to DNS service**
270+
271+
Run the following command:
272+
273+
```azurecli-interactive
274+
# From the Karpenter node, test connectivity to DNS service
275+
telnet 10.0.0.10 53 # Replace with your actual DNS service IP
276+
# Or using nc if telnet is not available
277+
nc -zv 10.0.0.10 53
278+
```
279+
280+
**Cause**
281+
282+
Common causes include:
283+
284+
- The `--dns-service-ip` parameter in `AKSNodeClass` is incorrect.
285+
- The DNS service IP isn't in the service Classless Inter-Domain Routing (CIDR) range.
286+
- Network connectivity issues exist between the node and DNS service.
287+
- `CoreDNS` pods aren't running or are misconfigured.
288+
- Firewall rules block DNS traffic.
289+
290+
**Solution**
291+
292+
Possible solutions include:
293+
294+
- Verify that the `--dns-service-ip` value matches the actual DNS service. To verify, run the following command:
295+
296+
```azurecli-interactive
297+
kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
298+
```
299+
300+
- Make sure that the DNS service IP is within the service CIDR range specified during cluster creation.
301+
- Check whether Karpenter nodes can reach the service subnets
302+
- Restart `CoreDNS pods` if they're in an error state. To restart, run the following command:
303+
304+
```azurecli-interactive
305+
kubectl rollout restart deployment/coredns -n kube-system
306+
```
307+
308+
- Verify that NSG rules allow traffic on port 53 (TCP/User Datagram Protocol (UDP)).
309+
- Run a connectivity analysis by using the [Azure Virtual Network Verifier](/azure/virtual-network-manager/overview) to verify outbound connectivity.
310+
311+
## Azure-specific issues
312+
313+
### Spot VM issues
314+
315+
**Symptoms**
316+
317+
Unexpected node terminations occur when you use spot instances.
318+
319+
**Debugging steps**
320+
321+
1. **Check node events**
322+
323+
Run the following command:
324+
325+
```azurecli-interactive
326+
kubectl get events | grep -i "spot\|evict"
327+
```
328+
329+
2. **Monitor spot VM pricing**
330+
331+
Run the following command:
332+
333+
```azurecli-interactive
334+
az vm list-sizes --location <region> --query "[?contains(name, 'Standard_D2s_v3')]"
335+
```
336+
337+
**Solution**
338+
339+
Possible solutions include:
340+
341+
- Use diverse instance types for better availability.
342+
- Implement proper pod disruption budgets.
343+
- Consider mixed spot and on-demand strategies.
344+
- Use workloads that are tolerant of node preemption.
345+
346+
### Quota exceeded
347+
348+
**Symptoms**
349+
350+
VM creation fails and generates "quota exceeded" errors.
351+
352+
**Debugging steps**
353+
354+
1. **Check current quota usage**
355+
356+
Run the following command:
357+
358+
```azurecli-interactive
359+
az vm list-usage --location <region> --query "[?currentValue >= limit]"
360+
```
361+
362+
**Solution**
363+
364+
Possible solutions include:
365+
366+
- Request quota increases through the Azure portal.
367+
- Expand nodepool custom resource definitions (CRDs) to include more VM sizes. For more information, see [NodePool configuration documentation][nap-nodepool-docs]. For example, nodepool specification A is less likely than nodepool specification B to trigger quota errors that stop VM creation if A includes D-family VMs and B is specific to only one VM size.
368+
369+
[!INCLUDE [Third-party disclaimer](~/includes/third-party-disclaimer.md)]
370+
371+
[!INCLUDE [Third-party contact disclaimer](~/includes/third-party-contact-disclaimer.md)]
372+
373+
[!INCLUDE [Azure Help Support](~/includes/azure-help-support.md)]

support/azure/azure-kubernetes/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,8 @@ items:
257257
href: extensions/troubleshoot-managed-namespaces.md
258258
- name: Troubleshoot network isolated clusters
259259
href: extensions/troubleshoot-network-isolated-cluster.md
260+
- name: Troubleshoot node auto provisioning
261+
href: extensions/troubleshoot-node-auto-provision.md
260262
- name: KEDA add-on
261263
items:
262264
- name: Breaking changes in KEDA add-on 2.15 and 2.14

0 commit comments

Comments
 (0)