You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Are you running on an Ubuntu 18.04 node? - RHEL 7.9
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I am trying to install Nvidia GPU operator on a Openshift 4.9 cluster on IBM Cloud
Its a single node cluster - the node has 2x P100 cards and ample CPU/RAM/Storage.
I was able to install the operator from operatorhub smoothly (have tried versions 1.9, 1.10, 1.11)
When I try to create ClusterPolicy, it gets created but its status never becomes Ready.
2. Steps to reproduce the issue
Not sure
3. Information to attach (optional if deemed irrelevant)
I tried some of the troubleshooting methods on the Nvidia docs on this.
After creating the ClusterPolicy, in other clusters I would immediately see a lot of pods getting created in the Init state. But I only see the operator pod in the nvidia-gpu-operator namespace.
I tried to see the operator logs using the command oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator
I see a consistent error here:
I0701 11:31:15.012773 1 request.go:665] Waited for 1.000296617s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.6566750761031365e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.656675076103603e+09 INFO setup starting manager
1.6566750761038709e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.656675076103909e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0701 11:31:16.103963 1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0701 11:31:31.435254 1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6566750914353292e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"31a99e69-e0f7-42be-99d8-f2b130d5355c","apiVersion":"v1","resourceVersion":"2067460"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354377e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"f2f831e0-0afa-47ea-b981-aeff213040be","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2067461"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354844e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.ClusterPolicy"}
1.65667509143553e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.Node"}
1.6566750914355376e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.DaemonSet"}
1.6566750914355426e+09 INFO controller.clusterpolicy-controller Starting Controller
1.6566750915367239e+09 INFO controllers.ClusterPolicy Reconciliate ClusterPolicies after node label update {"nb": 0}
1.6566750915367982e+09 INFO controller.clusterpolicy-controller Starting workers {"worker count": 1}
1.6566751728138194e+09 ERROR controllers.ClusterPolicy Failed to initialize ClusterPolicy controller {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751728139005e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729343596e+09 ERROR controllers.ClusterPolicy Failed to initialize ClusterPolicy controller {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729344525e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
I am not able to pinpoint why this is stuck with this particular error.
I have tried to look at NFD ( I guess its not able to get the cluster version from the node labels - but they are the same as another cluster where everything works)
Googling the error - the ClusterVersion CRD came up. Checked it and its defined properly in the cluster.
The text was updated successfully, but these errors were encountered:
Hi @shivamerla
I have the same issue, it looks like below
And I think the installation of gpu operator from operatorhub should NOT depend on the "Completed" cluster version of OCP.
Once the clusterversion exists, no such it is Completed or Partial, the cluster shall be treated as OCP.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I am trying to install Nvidia GPU operator on a Openshift 4.9 cluster on IBM Cloud
Its a single node cluster - the node has 2x P100 cards and ample CPU/RAM/Storage.
Ready
.2. Steps to reproduce the issue
Not sure
3. Information to attach (optional if deemed irrelevant)
I tried some of the troubleshooting methods on the Nvidia docs on this.
After creating the ClusterPolicy, in other clusters I would immediately see a lot of pods getting created in the
Init
state. But I only see the operator pod in thenvidia-gpu-operator
namespace.I tried to see the operator logs using the command
oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator
I see a consistent error here:
I am not able to pinpoint why this is stuck with this particular error.
The text was updated successfully, but these errors were encountered: