Cannot establish GPU-operator with GDRDMA #588

ReyRen · 2023-09-25T02:24:24Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04):
Centos7.9
Kernel Version:
Linux a800-master 3.10.0-1160.95.1.el7.x86_64 SMP Mon Jul 24 13:59:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
Docker
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
K8s 1.20.2
GPU Operator Version:
22.9.1

2. Issue or feature description

When I was attempting to use the GDRDMA feature, I followed the deployment instructions described in GPU-operator. I have already installed the OFED driver on my physical machine (non-containerized form), so I set the parameters "--set driver.rdma.enabled=true --set driver.rdma.useHostMofed=true." But the Driver-daemon pod get error:

Here are the pod status:

4. Information to attach (optional if deemed irrelevant)

kubernetes driver pods logs:

kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE

Full debug bundle already send to [email protected]*

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-10-19T05:28:25Z

@ReyRen from the debug bundle provided looks like driver pod logs are truncated. Can you get logs from "nvidia-driver-ctr" container within the driver pod. Looks like NVIDIA driver install is not going through. Attaching logs from dmesg also will help.

ruta-04 · 2024-02-20T21:53:39Z

I am also facing a similar issue. In my case, I want to enable RDMA and disable useHostMofed for Network Operator installation on Openshift:

[https://docs.nvidia.com/networking/display/cokan10/network+operator#src-39285883_NetworkOperator-DOCP]

Apart from the GPU-operator and monitoring pods, all others are stuck in Init state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot establish GPU-operator with GDRDMA #588

Cannot establish GPU-operator with GDRDMA #588

ReyRen commented Sep 25, 2023 •

edited

Loading

shivamerla commented Oct 19, 2023

ruta-04 commented Feb 20, 2024

Cannot establish GPU-operator with GDRDMA #588

Cannot establish GPU-operator with GDRDMA #588

Comments

ReyRen commented Sep 25, 2023 • edited Loading

1. Quick Debug Information

2. Issue or feature description

4. Information to attach (optional if deemed irrelevant)

shivamerla commented Oct 19, 2023

ruta-04 commented Feb 20, 2024

ReyRen commented Sep 25, 2023 •

edited

Loading