Skip to content

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Jan 5, 2026

Note

Adds vGPU (SR-IOV/mdev) support with full lifecycle and visibility across the stack.

  • GPU/vGPU core: Detect host mode (vgpu/passthrough), list vGPU profiles, create/destroy mdevs, reconcile orphaned mdevs on startup
  • API changes: POST /instances accepts gpu.profile; instance responses include gpu { profile, mdev_uuid }; /resources now returns gpu status (mode, total/used slots, profiles or devices); OpenAPI/spec and generated types updated
  • Hypervisor: QEMU args support mdev via vfio-pci,sysfsdev=/sys/bus/mdev/devices/<uuid> and standard PCI via host address
  • Instances: Persist GPUProfile/GPUMdevUUID; attach mdev to VM config; clean up mdev on delete; remove HasGPU from config disk
  • System/initrd: Drop NVIDIA driver injection; add kernel headers tarball and setup for DKMS inside guest
  • Tests: New integration/vgpu_test.go; GPU e2e/inference tests updated to install drivers via DKMS and validate GPU visibility
  • Misc: /resources conversion helpers for GPU; installer simplifies (removes setcap/user); dev runner executes binary with sudo

Written by Cursor Bugbot for commit 3b86dfe. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add vGPU support

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-go studio · code · diff

Your SDK built successfully.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@5fce1464a4feab41cf7c17dce3594cb258b2c76b
hypeman-typescript studio · code · diff

Your SDK built successfully.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/ccd1f652a022ffccc9ab83ee5177ac5550a37eb2/dist.tar.gz
hypeman-cli studio

Unknown conclusion: fatal


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-15 20:37:26 UTC

Base automatically changed from resources to main January 5, 2026 22:05
Resolved conflicts:
- cmd/api/main.go: Added devices import for mdev reconciliation
- lib/devices/GPU.md: Merged vGPU API docs with driver upgrade docs
- lib/instances/configdisk.go: Updated import paths to kernel
- lib/instances/delete.go: Added devices import for mdev cleanup
- lib/oapi/oapi.go: Regenerated to include GPU types
- lib/system/initrd.go: Updated import paths to kernel
- lib/system/versions.go: Added NVIDIA module/driver URL maps

Also updated import paths from onkernel to kernel in:
- integration/vgpu_test.go
- lib/devices/mdev.go
- lib/resources/gpu.go
@sjmiller609 sjmiller609 marked this pull request as ready for review January 15, 2026 20:04
Copy link

@tembo tembo bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, cohesive pass on vGPU support end-to-end (API → instance metadata → QEMU args → host mdev lifecycle + startup reconciliation), and the integration coverage is a good sanity check.

A couple small robustness tweaks I left inline:

  • Make ListMdevDevices resilient to mdevctl JSON format changes by falling back to sysfs if parsing fails.
  • Skip vf.HasMdev when selecting a VF in CreateMdev.
  • Prefer filepath.Base when extracting PCI addresses from sysfs paths in QEMU arg building.
  • Avoid returning mdev_uuid: "" in the instance API response.
  • Log failures when mdev cleanup fails during instance create rollback.
  • Confirm the install script’s move to root:root + 0640 config is intentional (it implies sudo for config reads).

@sjmiller609 sjmiller609 requested a review from rgarcia January 15, 2026 20:20
continue
}
// Profile is available - count all free VFs on this parent
count += len(parentVFs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential issue: this assumes if one VF can create a profile, all free VFs on that parent can too. but available_instances reflects remaining GPU resources (VRAM), not per-VF capacity.

example: if you have 16 VFs, 10 free, and 4GB VRAM remaining with L40S-4Q profiles, this would report available: 10 when really only 1 more can be created before VRAM exhaustion.

a more accurate approach might be to sum available_instances across all free VFs (though that might also double-count shared resources). alternatively, just read available_instances once from any VF since it reflects GPU-wide remaining capacity, not per-VF capacity.

not a blocker if this is known/acceptable behavior, but worth documenting or revisiting.

Copy link
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants