-
Notifications
You must be signed in to change notification settings - Fork 0
Add vGPU support #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add vGPU support #52
Conversation
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs. ✅ hypeman-go studio · code · diff
✅ hypeman-typescript studio · code · diff
❗ hypeman-cli studio
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
Resolved conflicts: - cmd/api/main.go: Added devices import for mdev reconciliation - lib/devices/GPU.md: Merged vGPU API docs with driver upgrade docs - lib/instances/configdisk.go: Updated import paths to kernel - lib/instances/delete.go: Added devices import for mdev cleanup - lib/oapi/oapi.go: Regenerated to include GPU types - lib/system/initrd.go: Updated import paths to kernel - lib/system/versions.go: Added NVIDIA module/driver URL maps Also updated import paths from onkernel to kernel in: - integration/vgpu_test.go - lib/devices/mdev.go - lib/resources/gpu.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, cohesive pass on vGPU support end-to-end (API → instance metadata → QEMU args → host mdev lifecycle + startup reconciliation), and the integration coverage is a good sanity check.
A couple small robustness tweaks I left inline:
- Make
ListMdevDevicesresilient tomdevctlJSON format changes by falling back to sysfs if parsing fails. - Skip
vf.HasMdevwhen selecting a VF inCreateMdev. - Prefer
filepath.Basewhen extracting PCI addresses from sysfs paths in QEMU arg building. - Avoid returning
mdev_uuid: ""in the instance API response. - Log failures when mdev cleanup fails during instance create rollback.
- Confirm the install script’s move to
root:root+0640config is intentional (it impliessudofor config reads).
| continue | ||
| } | ||
| // Profile is available - count all free VFs on this parent | ||
| count += len(parentVFs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential issue: this assumes if one VF can create a profile, all free VFs on that parent can too. but available_instances reflects remaining GPU resources (VRAM), not per-VF capacity.
example: if you have 16 VFs, 10 free, and 4GB VRAM remaining with L40S-4Q profiles, this would report available: 10 when really only 1 more can be created before VRAM exhaustion.
a more accurate approach might be to sum available_instances across all free VFs (though that might also double-count shared resources). alternatively, just read available_instances once from any VF since it reflects GPU-wide remaining capacity, not per-VF capacity.
not a blocker if this is known/acceptable behavior, but worth documenting or revisiting.
rgarcia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Note
Adds vGPU (SR-IOV/mdev) support with full lifecycle and visibility across the stack.
POST /instancesacceptsgpu.profile; instance responses includegpu { profile, mdev_uuid };/resourcesnow returnsgpustatus (mode, total/used slots, profiles or devices); OpenAPI/spec and generated types updatedvfio-pci,sysfsdev=/sys/bus/mdev/devices/<uuid>and standard PCI via host addressGPUProfile/GPUMdevUUID; attach mdev to VM config; clean up mdev on delete; removeHasGPUfrom config diskintegration/vgpu_test.go; GPU e2e/inference tests updated to install drivers via DKMS and validate GPU visibility/resourcesconversion helpers for GPU; installer simplifies (removes setcap/user); dev runner executes binary with sudoWritten by Cursor Bugbot for commit 3b86dfe. This will update automatically on new commits. Configure here.