Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: The configuration of relabeling still does not take effect! #605

Open
6 tasks
halohsu opened this issue Nov 6, 2023 · 7 comments
Open
6 tasks

bug: The configuration of relabeling still does not take effect! #605

halohsu opened this issue Nov 6, 2023 · 7 comments

Comments

@halohsu
Copy link

halohsu commented Nov 6, 2023

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
  • Kernel Version: 5.4.0-147-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0-rc.1
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, v1.26.2
  • GPU Operator Version: gpu-operator-v23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The relabeling function is supported in the values. yaml file in the official repository:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.2.6-3.1.9-ubuntu20.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []
    # - source_labels:
    #     - __meta_kubernetes_pod_node_name
    #   regex: (.*)
    #   target_label: instance
    #   replacement: $1
    #   action: replace

AND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:

cdi:
  enabled: true
  default: true
driver:
  enabled: false
  rdma:
    enabled: true
    useHostMofed: true
toolkit:
  enabled: false
validator:
  plugin:
    env:
      - name: WITH_WORKLOAD
        value: "false"
dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true
    relabelings:
      - action: replace
        sourceLabels:
          - __meta_kubernetes_pod_node_name
        targetLabel: instance

My Helm releases:

$ helm ls --all-namespaces
NAME                 	NAMESPACE    	REVISION	UPDATED                              	STATUS  	CHART                       	APP VERSION
gpu-operator         	gpu-operator 	10      	2023-11-06 16:58:33.967677 +0800 CST 	deployed	gpu-operator-v23.9.0        	v23.9.0

But the configuration of relabeling still does not take effect!

Others:

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

None.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@halohsu
Copy link
Author

halohsu commented Nov 13, 2023

Help!!!

@tariq1890
Copy link
Contributor

@bluemiaomiao Can you share the yaml manifest of the rendered dcgm-exporter daemonset?

@halohsu
Copy link
Author

halohsu commented Nov 20, 2023

➜  ~ kubectl get daemonsets -n gpu-operator nvidia-dcgm-exporter -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "3"
    nvidia.com/last-applied-hash: "80895247"
    openshift.io/scc: nvidia-dcgm-exporter
  creationTimestamp: "2023-09-09T08:40:45Z"
  generation: 3
  labels:
    app: nvidia-dcgm-exporter
    app.kubernetes.io/managed-by: gpu-operator
    helm.sh/chart: gpu-operator-v23.9.0
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: cluster-policy
    uid: f1b90f5e-45ba-4270-a048-ab210729fa91
  resourceVersion: "237868434"
  uid: 475fafee-3427-4b7b-8488-042ea3ef82df
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nvidia-dcgm-exporter
        app.kubernetes.io/managed-by: gpu-operator
        helm.sh/chart: gpu-operator-v23.9.0
    spec:
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcp-metrics-included.csv
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.6-3.1.9-ubuntu20.04
        imagePullPolicy: IfNotPresent
        name: nvidia-dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
          nvidia container stack to be setup; sleep 5; done
        command:
        - sh
        - -c
        image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
        imagePullPolicy: IfNotPresent
        name: toolkit-validation
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run/nvidia
          mountPropagation: HostToContainer
          name: run-nvidia
      nodeSelector:
        nvidia.com/gpu.deploy.dcgm-exporter: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      runtimeClassName: nvidia
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: nvidia-dcgm-exporter
      serviceAccountName: nvidia-dcgm-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /run/nvidia
          type: ""
        name: run-nvidia
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 26
  desiredNumberScheduled: 26
  numberAvailable: 26
  numberMisscheduled: 0
  numberReady: 26
  observedGeneration: 3
  updatedNumberScheduled: 26

@tariq1890

@halohsu
Copy link
Author

halohsu commented Nov 20, 2023

When I manually add the configuration through kubectl edit - n gpu operator servicemonitors.monitoring.coreos.com nvidia dcgm exporter, but after a while, my relabeling configuration will be deleted

@halohsu
Copy link
Author

halohsu commented Nov 20, 2023

This is a huge bug that has affected our production line monitoring and alerting, and obtaining Pod's IP has no practical value.

@halohsu
Copy link
Author

halohsu commented Nov 20, 2023

At present, after Lens or kubectl edit, the configuration will still be lost after a few minutes. A good method is to close the built-in ServiceMonitor:

kubectl delete -n gpu-operator servicemonitors.monitoring.coreos.com nvidia-dcgm-exporter

@derselbst
Copy link

I just ran into the same problem. I tried using a similar relabeling for the ServiceMonitor. But in my case, the helm chart failed to install when using the example config:

error validating data: [ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "source_labels" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings, ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "target_label" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings

I had to change it to targetLabel and sourceLabel to make the installation work. Yet, I don't see any relabeling taking effect.

        relabelings:
          - sourceLabels: [ __meta_kubernetes_pod_node_name ]
            action: replace
            targetLabel: kubernetes_node
          - sourceLabels: [ __meta_kubernetes_pod_container_name ]
            action: replace
            targetLabel: container
          - sourceLabels: [ __meta_kubernetes_namespace ]
            action: replace
            targetLabel: namespace
          - sourceLabels: [ __meta_kubernetes_pod_name ]
            action: replace
            targetLabel: pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants