This reference architecture provides a production-ready framework for orchestrating robotics and AI workloads on Microsoft Azure using NVIDIA technologies such as Isaac Lab, Isaac Sim, and OSMO. It demonstrates end-to-end reinforcement learning workflows, scalable training pipelines, and deployment processes with Azure-native authentication, storage, and ML services.
OSMO handles workflow orchestration and job scheduling while Azure provides elastic GPU compute, persistent checkpointing, MLflow experiment tracking, and enterprise grade security.
- Infrastructure as Code - Terraform modules referencing microsoft/edge-ai components for reproducible deployments
- Containerized Workflows - Docker-based Isaac Lab training with NVIDIA GPU support
- CI/CD Integration - Automated deployment pipelines with GitHub Actions
- MLflow Integration - Automatic experiment tracking and model versioning
- Automatic metric logging from SKRL agents to Azure ML
- Comprehensive tracking of episode statistics, losses, optimization metrics, and timing data
- Configurable logging intervals and metric filtering
- See MLflow Integration Guide for details
- Scalable Compute - Auto-scaling GPU nodes based on workload demands
- Cost Optimization - Pay-per-use compute with automatic scaling
- Enterprise Security - Entra ID integration
- Global Deployment - Multi-region support for worldwide teams
This reference architecture integrates:
- NVIDIA OSMO - Workflow orchestration and job scheduling
- Azure Machine Learning - Experiment tracking and model management
- Azure Kubernetes Service - Software in the Loop (SIL) training
- Azure Arc for Kubernetes - Software in the Loop (SIL) and Hardware in the Loop (HIL) training
- Azure Storage - Persistent data and checkpoint storage
- Azure Key Vault - Secure credential management
- Azure Monitor - Comprehensive logging and metrics
OSMO orchestration on Azure enables production-scale robotics training across industries. Some examples include:
- Warehouse AMRs - Train navigation policies with 1000+ parallel environments on auto-scaling AKS GPU nodes, checkpoint to Azure Storage, track experiments in Azure ML
- Manufacturing Arms - Develop manipulation strategies with physics-accurate simulation, leveraging Azure's global regions for distributed teams and pay-per-use GPU compute
- Legged Robots - Optimize locomotion policies with MLflow experiment tracking for sim-to-real transfer
- Collaborative Robots - Create safe interaction policies with Azure Monitor logging and metrics, enabling compliance auditing and performance diagnostics at scale
See OSMO workflow examples for job configuration templates.
- uv - Python package manager and environment tool
- Python 3.11 (required by Isaac Sim 5.X)
- Azure CLI (v2.50+)
- Terraform (v1.5+)
- NVIDIA OSMO CLI (latest)
- Docker with NVIDIA Container Toolkit
- hve-core - Copilot-assisted development workflows (install guide)
- Azure subscription with contributor access
- Sufficient quota for GPU VMs (Standard_NC6s_v3 or higher)
- Azure Machine Learning workspace (or permissions to create one)
- NVIDIA Developer account with OSMO access
- NGC API key for container registry access
./setup-dev.shThe setup script installs Python 3.11 via uv, creates a virtual environment at .venv/, and installs training dependencies.
Install hve-core for Copilot-assisted development workflows. The simplest method is the VS Code Extension.
The workspace is configured with python.analysis.extraPaths pointing to src/, enabling imports like:
from training.utils import AzureMLContext, bootstrap_azure_mlSelect the .venv/bin/python interpreter in VS Code for IntelliSense support
The workspace .vscode/settings.json also configures Copilot Chat to load instructions, prompts, and chat modes from hve-core:
| Setting | hve-core Paths |
|---|---|
chat.modeFilesLocations |
../hve-core/.github/chatmodes, ../hve-core/copilot/beads/chatmodes |
chat.instructionsFilesLocations |
../hve-core/.github/instructions, ../hve-core/copilot/beads/instructions |
chat.promptFilesLocations |
../hve-core/.github/prompts, ../hve-core/copilot/beads/prompts |
These paths resolve when hve-core is installed as a peer directory or via the VS Code Extension. Without hve-core, Copilot still functions but shared conventions, prompts, and chat modes are unavailable.
Once a tests/ directory exists, run the test suite:
uv run pytest tests/Run tests selectively by category:
# Unit tests only (fast, no external dependencies)
uv run pytest tests/ -m "not slow and not gpu"See the Testing Requirements section in CONTRIBUTING.md for test organization, markers, and coverage targets.
Reverse the changes made by setup-dev.sh and tear down deployed infrastructure.
# Remove Python virtual environment
rm -rf .venv
# Remove cloned IsaacLab repository
rm -rf external/IsaacLab
# Remove Node.js linting dependencies (if installed separately via npm install)
rm -rf node_modules
# Remove uv cache (optional, frees disk space)
uv cache cleancd deploy/001-iac
terraform destroy -var-file=terraform.tfvarsWarning
terraform destroy permanently deletes all deployed Azure resources including storage, AKS clusters, and Key Vault. Ensure training data and model checkpoints are backed up before destroying infrastructure.
See Cost Considerations for details on resource costs and cleanup timing.
.
βββ deploy/
β βββ 000-prerequisites/ # Prerequisites validation and setup
β βββ 001-iac/ # Infrastructure as Code deployment
β βββ 002-setup/ # Post-infrastructure setup
β βββ 003-data/ # Data preparation and upload
β βββ 004-workflow/ # Training workflow execution
β βββ job-templates/ # Job configuration templates
β βββ osmo/ # OSMO inline workflow submission (see osmo/README.md)
βββ src/
β βββ terraform/ # Infrastructure as Code
β β βββ modules/ # Reusable Terraform modules
β βββ training/ # Training code and tasks
β βββ common/ # Shared utilities
β βββ scripts/ # Framework-specific training scripts configured for Azure services
β β βββ rsl_rl/ # RSL_RL training scripts
β β βββ skrl/ # SKRL training scripts
β βββ tasks/ # Placeholder for Isaac Lab training tasks
This project is licensed under the MIT License. See LICENSE for details.
Review security guidance before deploying this reference architecture:
- SECURITY.md - vulnerability reporting and security considerations for deployers
- Security Guide - detailed security configuration inventory, deployment responsibilities, and checklist
For issues and questions:
- Review microsoft/edge-ai documentation
See the project roadmap for priorities, timelines, and success metrics covering Q1 2026 through Q1 2027.
This reference architecture builds upon:
- microsoft/edge-ai - Edge AI infrastructure components
- NVIDIA Isaac Lab - RL task framework
- NVIDIA Isaac Sim - Physics simulation
- NVIDIA OSMO - Workflow orchestration
- NVIDIA OSMO GitHub - Workflow orchestration