NVIDIA AI Cluster Runtime

AI Cluster Runtime (AICR) makes it easy to stand up GPU-accelerated Kubernetes clusters. It captures known-good combinations of drivers, operators, kernels, and system configurations and publishes them as version-locked recipes — reproducible artifacts for Helm, ArgoCD, and other deployment frameworks.

Why We Built This

Running GPU-accelerated Kubernetes clusters reliably is hard. Small differences in kernel versions, drivers, container runtimes, operators, and Kubernetes releases can cause failures that are difficult to diagnose and expensive to reproduce.

Historically, this knowledge has lived in internal validation pipelines and runbooks. AI Cluster Runtime makes it available to everyone.

Every AICR recipe is:

Optimized — Tuned for a specific combination of hardware, cloud, OS, and workload intent.
Validated — Passes automated constraint and compatibility checks before publishing.
Reproducible — Same inputs produce identical deployments every time.

Quick Start

Install and generate your first recipe in under two minutes:

# Install the CLI (Homebrew)
brew tap NVIDIA/aicr
brew install aicr

# Or use the install script
curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --


# Capture your cluster's current state
aicr snapshot --output snapshot.yaml

# Generate a validated recipe for your environment
aicr recipe --service eks --accelerator h100 --os ubuntu \
  --intent training --platform kubeflow -o recipe.yaml

# Validate the recipe against your cluster
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml

# Render into deployment-ready Helm charts
aicr bundle --recipe recipe.yaml -o ./bundles

The bundles/ directory contains per-component Helm charts with values files, checksums, and deployer configs. Deploy with helm install, commit to a GitOps repo, or use the built-in ArgoCD deployer.

See the Installation Guide for manual installation, building from source, and container images.

Features

Feature	Description
`aicr` CLI	Single binary. Generate recipes, create bundles, capture snapshots, validate configs.
API Server (`aicrd`)	REST API with the same capabilities as the CLI. Run in-cluster for CI/CD integration or air-gapped environments.
Snapshot Agent	Kubernetes Job that captures live cluster state (GPU hardware, drivers, OS, operators) into a ConfigMap for validation against recipes.
Supply Chain Security	SLSA Level 3 provenance, signed SBOMs, image attestations (cosign), and checksum verification on every release.

Supported Components

Dimension	This Release
Kubernetes	Amazon EKS, GKE, self-managed (Kind)
GPUs	NVIDIA H100, GB200
OS	Ubuntu
Workloads	Training (Kubeflow), Inference (Dynamo)
Components	GPU Operator, Network Operator, cert-manager, Prometheus stack, etc.

See the full Component Catalog for every component that can appear in a recipe. Don't see what you need? Open an issue — that feedback directly shapes what gets validated next.

How It Works

A recipe is a version-locked configuration for a specific environment. You describe your target (cloud, GPU, OS, workload intent), and the recipe engine matches it against a library of validated overlays — layered configurations that compose bottom-up from base defaults through cloud, accelerator, OS, and workload-specific tuning.

The bundler materializes a recipe into deployment-ready artifacts: one folder per component, each with Helm values, checksums, and a README. The validator compares a recipe against a live cluster snapshot and flags anything out of spec.

This separation means the same validated configuration works whether you deploy with Helm, ArgoCD, Flux, or a custom pipeline.

What AI Cluster Runtime Is Not

Not a Kubernetes distribution
Not a cluster provisioner or lifecycle management system
Not a managed control plane or hosted service
Not a replacement for your cloud provider or OEM platform

You bring your cluster and your tools. AI Cluster Runtime tells you what should be installed and how it should be configured.

Documentation

Choose the path that matches how you'll use the project.

User — Platform and Infrastructure Operators

Installation Guide — Install the aicr CLI (automated script, manual, or build from source)
CLI Reference — Complete command reference with examples
API Reference — REST API quick start
Agent Deployment — Deploy the Kubernetes agent for automated snapshots
Component Catalog — Every component that can appear in a recipe

Contributor — Developers and Maintainers

Contributing Guide — Development setup, testing, and PR process
Development Guide — Local development, Make targets, and tooling
Architecture Overview — System design and components
Bundler Development — How to create new bundlers
Data Architecture — Recipe data model and query matching
Agent Instructions — Coding-agent guidance for Codex/Copilot

Integrator — Automation and Platform Engineers

API Reference — REST API endpoints and usage examples
Data Flow — Understanding snapshots, recipes, and bundles
Automation Guide — CI/CD integration patterns
Kubernetes Deployment — Self-hosted API server setup
Recipe Development — Adding and modifying recipe metadata

Resources

Roadmap — Feature priorities and development timeline
Security — Supply chain security, vulnerability reporting, and verification
Releases — Binaries, SBOMs, and attestations
Issues — Bugs, feature requests, and questions

Contributing

AI Cluster Runtime is Apache 2.0. Contributions are welcome: new recipes for environments we haven't covered (OpenShift, AKS, bare metal), additional bundler formats, validation checks, or bug reports. See CONTRIBUTING.md for development setup and the PR process.

Name		Name	Last commit message	Last commit date
Latest commit History 640 Commits
.claude		.claude
.github		.github
api/aicr/v1		api/aicr/v1
cmd		cmd
demos		demos
docs		docs
examples		examples
infra		infra
kwok		kwok
pkg		pkg
recipes		recipes
site		site
tests		tests
tilt		tilt
tools		tools
validators		validators
vendor		vendor
.ctlptl-kwok.yaml		.ctlptl-kwok.yaml
.ctlptl.yaml		.ctlptl.yaml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.goreleaser.yaml		.goreleaser.yaml
.grype.yaml		.grype.yaml
.ko.yaml		.ko.yaml
.settings.yaml		.settings.yaml
.yamllint.yaml		.yamllint.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
cliff.toml		cliff.toml
go.mod		go.mod
go.sum		go.sum
install		install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA AI Cluster Runtime

Why We Built This

Quick Start

Features

Supported Components

How It Works

What AI Cluster Runtime Is Not

Documentation

Resources

Contributing

About

Uh oh!

Releases 56

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA AI Cluster Runtime

Why We Built This

Quick Start

Features

Supported Components

How It Works

What AI Cluster Runtime Is Not

Documentation

Resources

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 56

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages