Add vGPU support #52

sjmiller609 · 2026-01-05T16:50:04Z

Note

Adds vGPU (SR-IOV/mdev) support with full lifecycle and visibility across the stack.

GPU/vGPU core: Detect host mode (vgpu/passthrough), list vGPU profiles, create/destroy mdevs, reconcile orphaned mdevs on startup
API changes: POST /instances accepts gpu.profile; instance responses include gpu { profile, mdev_uuid }; /resources now returns gpu status (mode, total/used slots, profiles or devices); OpenAPI/spec and generated types updated
Hypervisor: QEMU args support mdev via vfio-pci,sysfsdev=/sys/bus/mdev/devices/<uuid> and standard PCI via host address
Instances: Persist GPUProfile/GPUMdevUUID; attach mdev to VM config; clean up mdev on delete; remove HasGPU from config disk
System/initrd: Drop NVIDIA driver injection; add kernel headers tarball and setup for DKMS inside guest
Tests: New integration/vgpu_test.go; GPU e2e/inference tests updated to install drivers via DKMS and validate GPU visibility
Misc: /resources conversion helpers for GPU; installer simplifies (removes setcap/user); dev runner executes binary with sudo

^{Written by Cursor Bugbot for commit 3b86dfe. This will update automatically on new commits. Configure here.}

github-actions · 2026-01-05T16:50:27Z

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add vGPU support

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ hypeman-go studio · code · diff

Your SDK built successfully.
generate ⚠️ → lint ✅ → test ✅
go get github.com/stainless-sdks/hypeman-go@5fce1464a4feab41cf7c17dce3594cb258b2c76b

✅ hypeman-typescript studio · code · diff

Your SDK built successfully.
generate ⚠️ → build ✅ → lint ✅ → test ✅
npm install https://pkg.stainless.com/s/hypeman-typescript/ccd1f652a022ffccc9ab83ee5177ac5550a37eb2/dist.tar.gz

❗ hypeman-cli studio

Unknown conclusion: fatal

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-15 20:37:26 UTC

lib/devices/mdev.go

lib/instances/create.go

lib/devices/mdev.go

lib/resources/gpu.go

Resolved conflicts: - cmd/api/main.go: Added devices import for mdev reconciliation - lib/devices/GPU.md: Merged vGPU API docs with driver upgrade docs - lib/instances/configdisk.go: Updated import paths to kernel - lib/instances/delete.go: Added devices import for mdev cleanup - lib/oapi/oapi.go: Regenerated to include GPU types - lib/system/initrd.go: Updated import paths to kernel - lib/system/versions.go: Added NVIDIA module/driver URL maps Also updated import paths from onkernel to kernel in: - integration/vgpu_test.go - lib/devices/mdev.go - lib/resources/gpu.go

tembo

Nice, cohesive pass on vGPU support end-to-end (API → instance metadata → QEMU args → host mdev lifecycle + startup reconciliation), and the integration coverage is a good sanity check.

A couple small robustness tweaks I left inline:

Make ListMdevDevices resilient to mdevctl JSON format changes by falling back to sysfs if parsing fails.
Skip vf.HasMdev when selecting a VF in CreateMdev.
Prefer filepath.Base when extracting PCI addresses from sysfs paths in QEMU arg building.
Avoid returning mdev_uuid: "" in the instance API response.
Log failures when mdev cleanup fails during instance create rollback.
Confirm the install script’s move to root:root + 0640 config is intentional (it implies sudo for config reads).

lib/devices/mdev.go

lib/hypervisor/qemu/config.go

cmd/api/api/instances.go

lib/instances/create.go

scripts/install.sh

lib/devices/mdev.go

lib/resources/gpu.go

rgarcia · 2026-01-16T00:00:54Z

lib/devices/mdev.go

+			continue
+		}
+		// Profile is available - count all free VFs on this parent
+		count += len(parentVFs)


potential issue: this assumes if one VF can create a profile, all free VFs on that parent can too. but available_instances reflects remaining GPU resources (VRAM), not per-VF capacity.

example: if you have 16 VFs, 10 free, and 4GB VRAM remaining with L40S-4Q profiles, this would report available: 10 when really only 1 more can be created before VRAM exhaustion.

a more accurate approach might be to sum available_instances across all free VFs (though that might also double-count shared resources). alternatively, just read available_instances once from any VF since it reflects GPU-wide remaining capacity, not per-VF capacity.

not a blocker if this is known/acceptable behavior, but worth documenting or revisiting.

rgarcia

lgtm!

tembo bot approved these changes Jan 5, 2026

View reviewed changes

lib/devices/mdev.go Show resolved Hide resolved

lib/devices/mdev.go Show resolved Hide resolved

lib/devices/mdev.go Show resolved Hide resolved

lib/instances/create.go Show resolved Hide resolved

lib/devices/mdev.go Show resolved Hide resolved

lib/resources/gpu.go Show resolved Hide resolved

Base automatically changed from resources to main January 5, 2026 22:05

sjmiller609 added 3 commits January 5, 2026 22:10

Add vGPU support

cd178a5

Add logging, safer orphan cleanup

a613122

Add test

df18ef1

sjmiller609 force-pushed the vgpu branch from 0a7321e to df18ef1 Compare January 5, 2026 22:10

Fix performance issue with looking up profile types

6f74c81

sjmiller609 commented Jan 5, 2026

View reviewed changes

lib/resources/gpu.go Show resolved Hide resolved

sjmiller609 added 3 commits January 15, 2026 14:17

Run as root

fdc6d6a

Automatically install headers

dac68f1

sjmiller609 marked this pull request as ready for review January 15, 2026 20:04

tembo bot reviewed Jan 15, 2026

View reviewed changes

cursor bot reviewed Jan 15, 2026

View reviewed changes

lib/devices/mdev.go Outdated Show resolved Hide resolved

sjmiller609 added 2 commits January 15, 2026 20:12

Update inference test to use DKMS

e0c019c

Address review comments

249e73e

sjmiller609 requested a review from rgarcia January 15, 2026 20:20

cursor bot reviewed Jan 15, 2026

View reviewed changes

lib/resources/gpu.go Show resolved Hide resolved

Performance optimized /resources endpoint for mdev scanning

3b86dfe

rgarcia reviewed Jan 16, 2026

View reviewed changes

rgarcia approved these changes Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add vGPU support #52

Add vGPU support #52

Uh oh!

sjmiller609 commented Jan 5, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tembo bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgarcia Jan 16, 2026

Uh oh!

rgarcia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add vGPU support #52

Are you sure you want to change the base?

Add vGPU support #52

Uh oh!

Conversation

sjmiller609 commented Jan 5, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tembo bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgarcia Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sjmiller609 commented Jan 5, 2026 •

edited by cursor bot

Loading

github-actions bot commented Jan 5, 2026 •

edited

Loading