Deploy NVIDIA Dynamo on Azure Kubernetes Service with H100 GPU nodes for disaggregated LLM inference, including KEDA autoscaling driven by TTFT p95 latency via Azure Managed Prometheus.
# 0. Prerequisites: set your NGC API key (https://ngc.nvidia.com/setup/api-key)
cp .envrc.example .envrc # or edit .envrc directly
export NGC_API_KEY=<your-ngc-api-key>
# 1. Create the AKS cluster (OIDC + Workload Identity + Managed GPU + Azure Monitor)
./setup.sh -x create-cluster
# 2. Install Dynamo Platform (operator + NATS + CRDs + NGC pull secret)
source .envrc
./install-dynamo.sh -x install
# 3. Install KEDA + wire Azure Managed Prometheus autoscaling
./install-keda.sh -x install
# 4. Expose the vllm-agg frontend
kubectl patch svc vllm-agg-frontend -n dynamo-cloud \
-p '{"spec":{"type":"LoadBalancer","ports":[{"port":8000,"targetPort":8000}]}}'
# 5. Wait for EXTERNAL-IP, then smoke test
EXTERNAL_IP=$(kubectl get svc vllm-agg-frontend -n dynamo-cloud \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hi!"}]}'
# 6. Load test — watch KEDA scale decode workers as TTFT p95 exceeds 300 ms
aiperf --base-url http://$EXTERNAL_IP:8000 \
--model Qwen/Qwen3-0.6B \
--num-prompts 200 --concurrency 32
kubectl get hpa -n dynamo-cloud -w.
├── .envrc # Environment defaults (source before running scripts)
├── setup.sh # Create AKS cluster (system + H100 GPU pool)
├── install-dynamo.sh # Install Dynamo Platform v1.0.2 via Helm
├── install-keda.sh # Install KEDA + Azure Workload Identity + apply manifests
├── manifests/
│ ├── agg.yaml # DynamoGraphDeployment: Frontend×2 + VllmDecodeWorker×2
│ ├── autoscale_ttf.yaml # KEDA ScaledObject: TTFT p95 → 2–4 decode workers
│ ├── config_prome.yaml # AMA ConfigMap: enable pod-annotation scraping (dynamo-cloud)
│ └── quickstart.yaml # DynamoGraphDeploymentRequest (Qwen3-0.6B, single GPU)
└── blog/
└── 2026-05-08/
└── nvidia-dynamo-on-aks/
└── index.md # Blog post (Docusaurus format)
| Setting | Value |
|---|---|
| Cluster name | dicasati-dynamo |
| Resource group | rg-dynamo |
| Location | eastus2 |
| Subscription | ME-MngEnv330367-dicasati-1 |
| Kubernetes version | 1.34.0 |
| System pool VM | Standard_D4ds_v5 (1 node) |
| GPU pool VM | Standard_NC40ads_H100_v5 (1–4) |
| GPU driver | AKS Managed GPU (--enable-managed-gpu) |
| OIDC / Workload ID | Enabled |
| Azure Monitor | Managed Prometheus (EastUS2) |
| Grafana | dynamo in rg-dynamo (westus3) |
aiperf load → Frontend (×2) → VllmDecodeWorker (×2–4)
↑
KEDA ScaledObject
(TTFT p95 > 300 ms → scale up)
↑
Azure Managed Prometheus (AMP)
← pod annotations scraped by AMA
↑
UAMI keda-prometheus-reader
(Monitoring Data Reader on AMW)
Federated credential → keda-operator SA
| Variable | Description |
|---|---|
CLUSTER_NAME |
AKS cluster name |
RESOURCE_GROUP |
Resource group |
LOCATION |
Azure region |
KUBECONFIG |
Path to kubeconfig (defaults to ./cluster.config) |
PROMETHEUS_ENDPOINT |
Azure Managed Prometheus query endpoint URL |
NGC_API_KEY |
NGC API key for nvcr.io image pulls (required) |
HF_TOKEN |
HuggingFace token (optional; needed for gated models) |
GRAFANA_NAME |
Azure Managed Grafana resource name |
-
authModes: "bearer"must NOT be set when usingprovider: azure-workloadin TriggerAuthentication. The workload identity provider handles auth automatically. Setting it causes KEDA to look for a static secret and fail. -
aks-preview extension overrides
az aks nodepool updateand injects agpuProfile.nvidia.managementModechange that is rejected by the API. Useaz restwith a PUT to the ARM API to enable cluster autoscaler on GPU pools. -
dynamo-crds chart was removed as a standalone chart in Dynamo v1.0.x. The CRDs are now bundled inside
dynamo-platform.install-dynamo.shhandles this gracefully. -
NGC pull secret must be created after the
dynamo-systemnamespace exists.install-dynamo.shnow creates the namespace first.