You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Container nvidia-peermem-ctr in pod nvidia-driver-daemonset crashed. As I post in the below, the log said RHEL_VERSION was not set. I think the container should have mounted /etc/os-release so that it can inspect RHEL_VERSION like other containers in the same pod. The failure was at the line DNF_RELEASEVER="${RHEL_VERSION}" in /usr/local/bin/nvidia-driver.
3. Steps to reproduce the issue
I just installed the gpu operator recently with mostly default setting on a RHCOS4.13/OpenShift4.12 cluster (spec.driver.rdma.enabled=true and spec.driver.rdma.useHostMofed=false).
My workaround was just to specifying driver version to be older one (535.104.05) instead of the latest one (535.104.12?) in spec.driver.version in my clusterpolicy.
@takeshi-yoshimura we are aware of this issue and fixing it as part of v23.9.1 release later this month. To workaround you can edit the nvidia-driver-daemonset and add an env RHEL_VERSION="" to nvidia-peermem-ctr container.
1. Quick Debug Information
2. Issue or feature description
Container nvidia-peermem-ctr in pod nvidia-driver-daemonset crashed. As I post in the below, the log said RHEL_VERSION was not set. I think the container should have mounted /etc/os-release so that it can inspect RHEL_VERSION like other containers in the same pod. The failure was at the line
DNF_RELEASEVER="${RHEL_VERSION}"
in /usr/local/bin/nvidia-driver.3. Steps to reproduce the issue
I just installed the gpu operator recently with mostly default setting on a RHCOS4.13/OpenShift4.12 cluster (
spec.driver.rdma.enabled=true
andspec.driver.rdma.useHostMofed=false
).My workaround was just to specifying driver version to be older one (535.104.05) instead of the latest one (535.104.12?) in
spec.driver.version
in my clusterpolicy.4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
kubectl get po:
oc logs nvidia-driver-daemonset-412.86.202306132230-0 -c nvidia-peermem-ctr
oc describe po nvidia-driver-daemonset-412.86.202306132230-0
Containers.nvidia-peermem-ctr should have had a mount for /host-etc/os-release but unfortunately it didn't have.
The text was updated successfully, but these errors were encountered: