Skip to content

appdevgbb/dynamo-on-aks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA Dynamo on AKS

Deploy NVIDIA Dynamo on Azure Kubernetes Service with H100 GPU nodes for disaggregated LLM inference, including KEDA autoscaling driven by TTFT p95 latency via Azure Managed Prometheus.

Quick Start

# 0. Prerequisites: set your NGC API key (https://ngc.nvidia.com/setup/api-key)
cp .envrc.example .envrc   # or edit .envrc directly
export NGC_API_KEY=<your-ngc-api-key>

# 1. Create the AKS cluster (OIDC + Workload Identity + Managed GPU + Azure Monitor)
./setup.sh -x create-cluster

# 2. Install Dynamo Platform (operator + NATS + CRDs + NGC pull secret)
source .envrc
./install-dynamo.sh -x install

# 3. Install KEDA + wire Azure Managed Prometheus autoscaling
./install-keda.sh -x install

# 4. Expose the vllm-agg frontend
kubectl patch svc vllm-agg-frontend -n dynamo-cloud \
  -p '{"spec":{"type":"LoadBalancer","ports":[{"port":8000,"targetPort":8000}]}}'

# 5. Wait for EXTERNAL-IP, then smoke test
EXTERNAL_IP=$(kubectl get svc vllm-agg-frontend -n dynamo-cloud \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$EXTERNAL_IP:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hi!"}]}'

# 6. Load test — watch KEDA scale decode workers as TTFT p95 exceeds 300 ms
aiperf --base-url http://$EXTERNAL_IP:8000 \
  --model Qwen/Qwen3-0.6B \
  --num-prompts 200 --concurrency 32
kubectl get hpa -n dynamo-cloud -w

Repository Structure

.
├── .envrc                               # Environment defaults (source before running scripts)
├── setup.sh                             # Create AKS cluster (system + H100 GPU pool)
├── install-dynamo.sh                    # Install Dynamo Platform v1.0.2 via Helm
├── install-keda.sh                      # Install KEDA + Azure Workload Identity + apply manifests
├── manifests/
│   ├── agg.yaml                         # DynamoGraphDeployment: Frontend×2 + VllmDecodeWorker×2
│   ├── autoscale_ttf.yaml               # KEDA ScaledObject: TTFT p95 → 2–4 decode workers
│   ├── config_prome.yaml                # AMA ConfigMap: enable pod-annotation scraping (dynamo-cloud)
│   └── quickstart.yaml                  # DynamoGraphDeploymentRequest (Qwen3-0.6B, single GPU)
└── blog/
    └── 2026-05-08/
        └── nvidia-dynamo-on-aks/
            └── index.md                 # Blog post (Docusaurus format)

Cluster Configuration

Setting Value
Cluster name dicasati-dynamo
Resource group rg-dynamo
Location eastus2
Subscription ME-MngEnv330367-dicasati-1
Kubernetes version 1.34.0
System pool VM Standard_D4ds_v5 (1 node)
GPU pool VM Standard_NC40ads_H100_v5 (1–4)
GPU driver AKS Managed GPU (--enable-managed-gpu)
OIDC / Workload ID Enabled
Azure Monitor Managed Prometheus (EastUS2)
Grafana dynamo in rg-dynamo (westus3)

Autoscaling Architecture

aiperf load → Frontend (×2) → VllmDecodeWorker (×2–4)
                                      ↑
                              KEDA ScaledObject
                         (TTFT p95 > 300 ms → scale up)
                                      ↑
                        Azure Managed Prometheus (AMP)
                         ← pod annotations scraped by AMA
                                      ↑
                        UAMI keda-prometheus-reader
                         (Monitoring Data Reader on AMW)
                         Federated credential → keda-operator SA

Environment Variables (.envrc)

Variable Description
CLUSTER_NAME AKS cluster name
RESOURCE_GROUP Resource group
LOCATION Azure region
KUBECONFIG Path to kubeconfig (defaults to ./cluster.config)
PROMETHEUS_ENDPOINT Azure Managed Prometheus query endpoint URL
NGC_API_KEY NGC API key for nvcr.io image pulls (required)
HF_TOKEN HuggingFace token (optional; needed for gated models)
GRAFANA_NAME Azure Managed Grafana resource name

Known Issues / Gotchas

  • authModes: "bearer" must NOT be set when using provider: azure-workload in TriggerAuthentication. The workload identity provider handles auth automatically. Setting it causes KEDA to look for a static secret and fail.

  • aks-preview extension overrides az aks nodepool update and injects a gpuProfile.nvidia.managementMode change that is rejected by the API. Use az rest with a PUT to the ARM API to enable cluster autoscaler on GPU pools.

  • dynamo-crds chart was removed as a standalone chart in Dynamo v1.0.x. The CRDs are now bundled inside dynamo-platform. install-dynamo.sh handles this gracefully.

  • NGC pull secret must be created after the dynamo-system namespace exists. install-dynamo.sh now creates the namespace first.

References

About

NVIDIA Dynamo on AKS with KEDA autoscaling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages