-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591
Comments
@ppetko can you check logs of |
Hi @shivamerla , It looks like it failed.
|
@ppetko can you get logs from the |
@cdesiniotis there is no such pod
This is the cluster policy I'm using
|
Is vGPU manager already installed on the host (e.g. does running Can you also describe your GPU nodes? In particular I am interested in the value of this node label |
According to the docs, the vGPU manager should be deployed by the NVIDIA operator. In the
These are all of the nvidia labels
|
Can you |
It looks like I don't have the daemonset for the vgpu-manager, that explains why I don't any pods. I have specified this label
I have opened this case, but not much traction https://forums.developer.nvidia.com/t/rror-getting-vgpu-config-error-getting-all-vgpu-devices-unable-to-read-mdev-devices-directory-open-sys-bus-mdev-devices-no-such-file-or-directory/267696 This is the output of all resources in the namespace
|
This doesn't seem right, if the node is labelled as |
Ah, below section is wrong.
This should be
|
Hm interesting - this yaml was generated by the clusterpolicy install using the UI. Look at the logs below... Let me redeploy with the correct yaml file.
|
A little heads up in the docs would be nice that once you deploy the clusterpolicy, the operator will roll the cluster and restart each node. I see 2 new machine configs are applied and the cluster is trying to update. The problem it's stuck on a node that doesn't have a GPU. I have already loaded the kernel parameters for the GPUs using a machine config only for the nodes that contains a GPU. What exactly are the machine configs trying to configure? Are there any docs on this process? The kernel modules are already loaded
Output of the mcp
On the bright side, I think the deployment is fixed. I checked the clusterpolicy UI is generating a wrong clusterpolicy using the UI in version 23.6.1 provided by NVIDIA Corporation.
|
@ppetko AFAIK, we don't update MachineConfig at all from our code. What is the actual change that is being applied through MachineConfig? May be some other operator(OSV?) triggered that? |
From what I can see, as soon as we applied the correct
Now the worker machine config pool is in degraded state.
I will create a smaller cluster with GPU nodes only and then I will attempt the installation again. Thank you. |
@fabiendupont any idea why the machineconfig got updated in this case? |
I don't see an obvious reason. It could be that the MachineConfigPool node selector uses labels created by NVIDIA GPU Operator. @ppetko, can you describe the MachineConfigPool ? |
1. Quick Debug Information
2. Issue or feature description
We can't configure the vGPUs using NVIDIA operator following the docs here https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/openshift-virtualization.html
3. Steps to reproduce the issue
This is our cluster policy
4. Debug info
4.1 When we specify this label
nvidia.com/vgpu.config=A100-1-5C
for each node4.1 When we don't specify any specific gpu labels and let the nvidia operator handle the selection
The text was updated successfully, but these errors were encountered: