Skip to content

Azure-Samples/azure-nvidia-robotics-reference-architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

78 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸ€– Azure Robotics Reference Architecture with NVIDIA OSMO

This reference architecture provides a production-ready framework for orchestrating robotics and AI workloads on Microsoft Azure using NVIDIA technologies such as Isaac Lab, Isaac Sim, and OSMO. It demonstrates end-to-end reinforcement learning workflows, scalable training pipelines, and deployment processes with Azure-native authentication, storage, and ML services.

πŸš€ Key Features

OSMO handles workflow orchestration and job scheduling while Azure provides elastic GPU compute, persistent checkpointing, MLflow experiment tracking, and enterprise grade security.

  • Infrastructure as Code - Terraform modules referencing microsoft/edge-ai components for reproducible deployments
  • Containerized Workflows - Docker-based Isaac Lab training with NVIDIA GPU support
  • CI/CD Integration - Automated deployment pipelines with GitHub Actions
  • MLflow Integration - Automatic experiment tracking and model versioning
    • Automatic metric logging from SKRL agents to Azure ML
    • Comprehensive tracking of episode statistics, losses, optimization metrics, and timing data
    • Configurable logging intervals and metric filtering
    • See MLflow Integration Guide for details
  • Scalable Compute - Auto-scaling GPU nodes based on workload demands
  • Cost Optimization - Pay-per-use compute with automatic scaling
  • Enterprise Security - Entra ID integration
  • Global Deployment - Multi-region support for worldwide teams

πŸ—Ό Architecture Overview

This reference architecture integrates:

  • NVIDIA OSMO - Workflow orchestration and job scheduling
  • Azure Machine Learning - Experiment tracking and model management
  • Azure Kubernetes Service - Software in the Loop (SIL) training
  • Azure Arc for Kubernetes - Software in the Loop (SIL) and Hardware in the Loop (HIL) training
  • Azure Storage - Persistent data and checkpoint storage
  • Azure Key Vault - Secure credential management
  • Azure Monitor - Comprehensive logging and metrics

🌍 Real World Examples

OSMO orchestration on Azure enables production-scale robotics training across industries. Some examples include:

  • Warehouse AMRs - Train navigation policies with 1000+ parallel environments on auto-scaling AKS GPU nodes, checkpoint to Azure Storage, track experiments in Azure ML
  • Manufacturing Arms - Develop manipulation strategies with physics-accurate simulation, leveraging Azure's global regions for distributed teams and pay-per-use GPU compute
  • Legged Robots - Optimize locomotion policies with MLflow experiment tracking for sim-to-real transfer
  • Collaborative Robots - Create safe interaction policies with Azure Monitor logging and metrics, enabling compliance auditing and performance diagnostics at scale

See OSMO workflow examples for job configuration templates.

πŸ§‘πŸ½β€πŸ’» Prerequisites and Requirements

Required Tools

Azure Requirements

  • Azure subscription with contributor access
  • Sufficient quota for GPU VMs (Standard_NC6s_v3 or higher)
  • Azure Machine Learning workspace (or permissions to create one)

NVIDIA Requirements

  • NVIDIA Developer account with OSMO access
  • NGC API key for container registry access

πŸƒβ€βž‘οΈ Quick Start

./setup-dev.sh

The setup script installs Python 3.11 via uv, creates a virtual environment at .venv/, and installs training dependencies.

Install hve-core for Copilot-assisted development workflows. The simplest method is the VS Code Extension.

VS Code Configuration

The workspace is configured with python.analysis.extraPaths pointing to src/, enabling imports like:

from training.utils import AzureMLContext, bootstrap_azure_ml

Select the .venv/bin/python interpreter in VS Code for IntelliSense support

The workspace .vscode/settings.json also configures Copilot Chat to load instructions, prompts, and chat modes from hve-core:

Setting hve-core Paths
chat.modeFilesLocations ../hve-core/.github/chatmodes, ../hve-core/copilot/beads/chatmodes
chat.instructionsFilesLocations ../hve-core/.github/instructions, ../hve-core/copilot/beads/instructions
chat.promptFilesLocations ../hve-core/.github/prompts, ../hve-core/copilot/beads/prompts

These paths resolve when hve-core is installed as a peer directory or via the VS Code Extension. Without hve-core, Copilot still functions but shared conventions, prompts, and chat modes are unavailable.

πŸ§ͺ Running Tests

Once a tests/ directory exists, run the test suite:

uv run pytest tests/

Run tests selectively by category:

# Unit tests only (fast, no external dependencies)
uv run pytest tests/ -m "not slow and not gpu"

See the Testing Requirements section in CONTRIBUTING.md for test organization, markers, and coverage targets.

🧹 Cleanup and Uninstall

Reverse the changes made by setup-dev.sh and tear down deployed infrastructure.

Remove Development Environment

# Remove Python virtual environment
rm -rf .venv

# Remove cloned IsaacLab repository
rm -rf external/IsaacLab

# Remove Node.js linting dependencies (if installed separately via npm install)
rm -rf node_modules

# Remove uv cache (optional, frees disk space)
uv cache clean

Destroy Azure Infrastructure

cd deploy/001-iac
terraform destroy -var-file=terraform.tfvars

Warning

terraform destroy permanently deletes all deployed Azure resources including storage, AKS clusters, and Key Vault. Ensure training data and model checkpoints are backed up before destroying infrastructure.

See Cost Considerations for details on resource costs and cleanup timing.

🧱 Repository Structure

.
β”œβ”€β”€ deploy/
β”‚   β”œβ”€β”€ 000-prerequisites/              # Prerequisites validation and setup
β”‚   β”œβ”€β”€ 001-iac/                        # Infrastructure as Code deployment
β”‚   β”œβ”€β”€ 002-setup/                      # Post-infrastructure setup
β”‚   β”œβ”€β”€ 003-data/                       # Data preparation and upload
β”‚   └── 004-workflow/                   # Training workflow execution
β”‚       β”œβ”€β”€ job-templates/              # Job configuration templates
β”‚       └── osmo/                       # OSMO inline workflow submission (see osmo/README.md)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ terraform/                      # Infrastructure as Code
β”‚   β”‚   └── modules/                    # Reusable Terraform modules
β”‚   └── training/                       # Training code and tasks
β”‚       β”œβ”€β”€ common/                     # Shared utilities
β”‚       β”œβ”€β”€ scripts/                    # Framework-specific training scripts configured for Azure services
β”‚       β”‚   β”œβ”€β”€ rsl_rl/                 # RSL_RL training scripts
β”‚       β”‚   β”œβ”€β”€ skrl/                   # SKRL training scripts
β”‚       └── tasks/                      # Placeholder for Isaac Lab training tasks

πŸͺͺ License

This project is licensed under the MIT License. See LICENSE for details.

πŸ”’ Security

Review security guidance before deploying this reference architecture:

  • SECURITY.md - vulnerability reporting and security considerations for deployers
  • Security Guide - detailed security configuration inventory, deployment responsibilities, and checklist

🀝 Support

For issues and questions:

πŸ—ΊοΈ Roadmap

See the project roadmap for priorities, timelines, and success metrics covering Q1 2026 through Q1 2027.

πŸ™ Acknowledgments

This reference architecture builds upon:

About

Reference architecture for robotics leveraging NVIDIA OSMO in Azure

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 10