Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Policy fails to start on Openshift 4.9 #368

Open
2 of 5 tasks
manishdash12 opened this issue Jul 1, 2022 · 3 comments
Open
2 of 5 tasks

Cluster Policy fails to start on Openshift 4.9 #368

manishdash12 opened this issue Jul 1, 2022 · 3 comments

Comments

@manishdash12
Copy link

manishdash12 commented Jul 1, 2022

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? - RHEL 7.9
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to install Nvidia GPU operator on a Openshift 4.9 cluster on IBM Cloud
Its a single node cluster - the node has 2x P100 cards and ample CPU/RAM/Storage.

  • I was able to install the operator from operatorhub smoothly (have tried versions 1.9, 1.10, 1.11)
  • When I try to create ClusterPolicy, it gets created but its status never becomes Ready.

2. Steps to reproduce the issue

Not sure

3. Information to attach (optional if deemed irrelevant)

I tried some of the troubleshooting methods on the Nvidia docs on this.

  1. After creating the ClusterPolicy, in other clusters I would immediately see a lot of pods getting created in the Init state. But I only see the operator pod in the nvidia-gpu-operator namespace.

  2. I tried to see the operator logs using the command oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator

    I see a consistent error here:

I0701 11:31:15.012773       1 request.go:665] Waited for 1.000296617s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.6566750761031365e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.656675076103603e+09   INFO    setup   starting manager
1.6566750761038709e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.656675076103909e+09   INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0701 11:31:16.103963       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0701 11:31:31.435254       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6566750914353292e+09  DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"31a99e69-e0f7-42be-99d8-f2b130d5355c","apiVersion":"v1","resourceVersion":"2067460"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354377e+09  DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"f2f831e0-0afa-47ea-b981-aeff213040be","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2067461"}, "reason": "LeaderElection", "message": "gpu-operator-776dbc5f44-4fttb_ea480d69-70b5-4528-a7bb-c8264500c94a became leader"}
1.6566750914354844e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.ClusterPolicy"}
1.65667509143553e+09    INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.Node"}
1.6566750914355376e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.DaemonSet"}
1.6566750914355426e+09  INFO    controller.clusterpolicy-controller     Starting Controller
1.6566750915367239e+09  INFO    controllers.ClusterPolicy       Reconciliate ClusterPolicies after node label update    {"nb": 0}
1.6566750915367982e+09  INFO    controller.clusterpolicy-controller     Starting workers        {"worker count": 1}
1.6566751728138194e+09  ERROR   controllers.ClusterPolicy       Failed to initialize ClusterPolicy controller   {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751728139005e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729343596e+09  ERROR   controllers.ClusterPolicy       Failed to initialize ClusterPolicy controller   {"error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6566751729344525e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "gpu-cluster-policy", "namespace": "", "error": "Failed to find Completed Cluster Version"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

I am not able to pinpoint why this is stuck with this particular error.

  • I have tried to look at NFD ( I guess its not able to get the cluster version from the node labels - but they are the same as another cluster where everything works)
  • Googling the error - the ClusterVersion CRD came up. Checked it and its defined properly in the cluster.
@shivamerla
Copy link
Contributor

@manishdash12 Can you get the output of oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).

    history:
    - completionTime: "2022-03-15T14:44:39Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032
      startedTime: "2022-03-15T14:19:46Z"
      state: Completed
      verified: false
      version: 4.9.24
    observedGeneration: 2
    versionHash: J4j8PKeiaRA=

@shivamerla
Copy link
Contributor

@manishdash12 any update on this?

@MingZhang-YBPS
Copy link

MingZhang-YBPS commented Jan 23, 2024

@manishdash12 Can you get the output of oc get clusterversions -o yaml, we look for the last successful updated version.(i.e state: Completed).

    history:
    - completionTime: "2022-03-15T14:44:39Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:6a899c54dda6b844bb12a247e324a0f6cde367e880b73ba110c056df6d018032
      startedTime: "2022-03-15T14:19:46Z"
      state: Completed
      verified: false
      version: 4.9.24
    observedGeneration: 2
    versionHash: J4j8PKeiaRA=

Hi @shivamerla
I have the same issue, it looks like below
image

And I think the installation of gpu operator from operatorhub should NOT depend on the "Completed" cluster version of OCP.
Once the clusterversion exists, no such it is Completed or Partial, the cluster shall be treated as OCP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants