Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot establish GPU-operator with GDRDMA #588

Open
2 tasks done
ReyRen opened this issue Sep 25, 2023 · 2 comments
Open
2 tasks done

Cannot establish GPU-operator with GDRDMA #588

ReyRen opened this issue Sep 25, 2023 · 2 comments

Comments

@ReyRen
Copy link

ReyRen commented Sep 25, 2023

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
    Centos7.9
  • Kernel Version:
    Linux a800-master 3.10.0-1160.95.1.el7.x86_64 SMP Mon Jul 24 13:59:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
    Docker
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
    K8s 1.20.2
  • GPU Operator Version:
    22.9.1

2. Issue or feature description

When I was attempting to use the GDRDMA feature, I followed the deployment instructions described in GPU-operator. I have already installed the OFED driver on my physical machine (non-containerized form), so I set the parameters "--set driver.rdma.enabled=true --set driver.rdma.useHostMofed=true." But the Driver-daemon pod get error:
图片

Here are the pod status:
图片

4. Information to attach (optional if deemed irrelevant)

  • kubernetes driver pods logs:
图片
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
图片

Full debug bundle already send to [email protected]*

@shivamerla
Copy link
Contributor

@ReyRen from the debug bundle provided looks like driver pod logs are truncated. Can you get logs from "nvidia-driver-ctr" container within the driver pod. Looks like NVIDIA driver install is not going through. Attaching logs from dmesg also will help.

@ruta-04
Copy link

ruta-04 commented Feb 20, 2024

I am also facing a similar issue. In my case, I want to enable RDMA and disable useHostMofed for Network Operator installation on Openshift:

[https://docs.nvidia.com/networking/display/cokan10/network+operator#src-39285883_NetworkOperator-DOCP]

Apart from the GPU-operator and monitoring pods, all others are stuck in Init state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants