NVIDIA Dynamo on AKS

Deploy NVIDIA Dynamo on Azure Kubernetes Service with H100 GPU nodes for disaggregated LLM inference, including KEDA autoscaling driven by TTFT p95 latency via Azure Managed Prometheus.

Quick Start

# 0. Prerequisites: set your NGC API key (https://ngc.nvidia.com/setup/api-key)
cp .envrc.example .envrc   # or edit .envrc directly
export NGC_API_KEY=<your-ngc-api-key>

# 1. Create the AKS cluster (OIDC + Workload Identity + Managed GPU + Azure Monitor)
./setup.sh -x create-cluster

# 2. Install Dynamo Platform (operator + NATS + CRDs + NGC pull secret)
source .envrc
./install-dynamo.sh -x install

# 3. Install KEDA + wire Azure Managed Prometheus autoscaling
./install-keda.sh -x install

# 4. Expose the vllm-agg frontend
kubectl patch svc vllm-agg-frontend -n dynamo-cloud \
  -p '{"spec":{"type":"LoadBalancer","ports":[{"port":8000,"targetPort":8000}]}}'

# 5. Wait for EXTERNAL-IP, then smoke test
EXTERNAL_IP=$(kubectl get svc vllm-agg-frontend -n dynamo-cloud \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hi!"}]}'

# 6. Load test — watch KEDA scale decode workers as TTFT p95 exceeds 300 ms
aiperf --base-url http://$EXTERNAL_IP:8000 \
  --model Qwen/Qwen3-0.6B \
  --num-prompts 200 --concurrency 32
kubectl get hpa -n dynamo-cloud -w

Repository Structure

.
├── .envrc                               # Environment defaults (source before running scripts)
├── setup.sh                             # Create AKS cluster (system + H100 GPU pool)
├── install-dynamo.sh                    # Install Dynamo Platform v1.0.2 via Helm
├── install-keda.sh                      # Install KEDA + Azure Workload Identity + apply manifests
├── manifests/
│   ├── agg.yaml                         # DynamoGraphDeployment: Frontend×2 + VllmDecodeWorker×2
│   ├── autoscale_ttf.yaml               # KEDA ScaledObject: TTFT p95 → 2–4 decode workers
│   ├── config_prome.yaml                # AMA ConfigMap: enable pod-annotation scraping (dynamo-cloud)
│   └── quickstart.yaml                  # DynamoGraphDeploymentRequest (Qwen3-0.6B, single GPU)
└── blog/
    └── 2026-05-08/
        └── nvidia-dynamo-on-aks/
            └── index.md                 # Blog post (Docusaurus format)

Cluster Configuration

Setting	Value
Cluster name	`dicasati-dynamo`
Resource group	`rg-dynamo`
Location	`eastus2`
Subscription	`ME-MngEnv330367-dicasati-1`
Kubernetes version	`1.34.0`
System pool VM	`Standard_D4ds_v5` (1 node)
GPU pool VM	`Standard_NC40ads_H100_v5` (1–4)
GPU driver	AKS Managed GPU (`--enable-managed-gpu`)
OIDC / Workload ID	Enabled
Azure Monitor	Managed Prometheus (EastUS2)
Grafana	`dynamo` in `rg-dynamo` (westus3)

Autoscaling Architecture

aiperf load → Frontend (×2) → VllmDecodeWorker (×2–4)
                                      ↑
                              KEDA ScaledObject
                         (TTFT p95 > 300 ms → scale up)
                                      ↑
                        Azure Managed Prometheus (AMP)
                         ← pod annotations scraped by AMA
                                      ↑
                        UAMI keda-prometheus-reader
                         (Monitoring Data Reader on AMW)
                         Federated credential → keda-operator SA

Environment Variables (.envrc)

Variable	Description
`CLUSTER_NAME`	AKS cluster name
`RESOURCE_GROUP`	Resource group
`LOCATION`	Azure region
`KUBECONFIG`	Path to kubeconfig (defaults to `./cluster.config`)
`PROMETHEUS_ENDPOINT`	Azure Managed Prometheus query endpoint URL
`NGC_API_KEY`	NGC API key for `nvcr.io` image pulls (required)
`HF_TOKEN`	HuggingFace token (optional; needed for gated models)
`GRAFANA_NAME`	Azure Managed Grafana resource name

Known Issues / Gotchas

authModes: "bearer" must NOT be set when using provider: azure-workload in TriggerAuthentication. The workload identity provider handles auth automatically. Setting it causes KEDA to look for a static secret and fail.
aks-preview extension overrides az aks nodepool update and injects a gpuProfile.nvidia.managementMode change that is rejected by the API. Use az rest with a PUT to the ARM API to enable cluster autoscaler on GPU pools.
dynamo-crds chart was removed as a standalone chart in Dynamo v1.0.x. The CRDs are now bundled inside dynamo-platform. install-dynamo.sh handles this gracefully.
NGC pull secret must be created after the dynamo-system namespace exists. install-dynamo.sh now creates the namespace first.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
manifests		manifests
tests		tests
.gitignore		.gitignore
README.md		README.md
install-dynamo.sh		install-dynamo.sh
install-keda.sh		install-keda.sh
setup.sh		setup.sh
toolagent_trace.jsonl		toolagent_trace.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA Dynamo on AKS

Quick Start

Repository Structure

Cluster Configuration

Autoscaling Architecture

Environment Variables (.envrc)

Known Issues / Gotchas

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Dynamo on AKS

Quick Start

Repository Structure

Cluster Configuration

Autoscaling Architecture

Environment Variables (.envrc)

Known Issues / Gotchas

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages