bug: The configuration of relabeling still does not take effect! #605

halohsu · 2023-11-06T09:54:07Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
Kernel Version: 5.4.0-147-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0-rc.1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, v1.26.2
GPU Operator Version: gpu-operator-v23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The relabeling function is supported in the values. yaml file in the official repository:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.2.6-3.1.9-ubuntu20.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []
    # - source_labels:
    #     - __meta_kubernetes_pod_node_name
    #   regex: (.*)
    #   target_label: instance
    #   replacement: $1
    #   action: replace

AND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:

cdi:
  enabled: true
  default: true
driver:
  enabled: false
  rdma:
    enabled: true
    useHostMofed: true
toolkit:
  enabled: false
validator:
  plugin:
    env:
      - name: WITH_WORKLOAD
        value: "false"
dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true
    relabelings:
      - action: replace
        sourceLabels:
          - __meta_kubernetes_pod_node_name
        targetLabel: instance

My Helm releases:

$ helm ls --all-namespaces
NAME                 	NAMESPACE    	REVISION	UPDATED                              	STATUS  	CHART                       	APP VERSION
gpu-operator         	gpu-operator 	10      	2023-11-06 16:58:33.967677 +0800 CST 	deployed	gpu-operator-v23.9.0        	v23.9.0

But the configuration of relabeling still does not take effect!

Others:

Relabelings for ServiceMonitor #537

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

None.

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

The text was updated successfully, but these errors were encountered:

halohsu · 2023-11-13T06:39:19Z

Help!!!

tariq1890 · 2023-11-15T19:04:54Z

@bluemiaomiao Can you share the yaml manifest of the rendered dcgm-exporter daemonset?

halohsu · 2023-11-20T08:55:05Z

➜  ~ kubectl get daemonsets -n gpu-operator nvidia-dcgm-exporter -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "3"
    nvidia.com/last-applied-hash: "80895247"
    openshift.io/scc: nvidia-dcgm-exporter
  creationTimestamp: "2023-09-09T08:40:45Z"
  generation: 3
  labels:
    app: nvidia-dcgm-exporter
    app.kubernetes.io/managed-by: gpu-operator
    helm.sh/chart: gpu-operator-v23.9.0
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: cluster-policy
    uid: f1b90f5e-45ba-4270-a048-ab210729fa91
  resourceVersion: "237868434"
  uid: 475fafee-3427-4b7b-8488-042ea3ef82df
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nvidia-dcgm-exporter
        app.kubernetes.io/managed-by: gpu-operator
        helm.sh/chart: gpu-operator-v23.9.0
    spec:
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcp-metrics-included.csv
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.6-3.1.9-ubuntu20.04
        imagePullPolicy: IfNotPresent
        name: nvidia-dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
          nvidia container stack to be setup; sleep 5; done
        command:
        - sh
        - -c
        image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
        imagePullPolicy: IfNotPresent
        name: toolkit-validation
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run/nvidia
          mountPropagation: HostToContainer
          name: run-nvidia
      nodeSelector:
        nvidia.com/gpu.deploy.dcgm-exporter: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      runtimeClassName: nvidia
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: nvidia-dcgm-exporter
      serviceAccountName: nvidia-dcgm-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /run/nvidia
          type: ""
        name: run-nvidia
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 26
  desiredNumberScheduled: 26
  numberAvailable: 26
  numberMisscheduled: 0
  numberReady: 26
  observedGeneration: 3
  updatedNumberScheduled: 26

@tariq1890

halohsu · 2023-11-20T09:31:57Z

When I manually add the configuration through kubectl edit - n gpu operator servicemonitors.monitoring.coreos.com nvidia dcgm exporter, but after a while, my relabeling configuration will be deleted

halohsu · 2023-11-20T09:32:48Z

This is a huge bug that has affected our production line monitoring and alerting, and obtaining Pod's IP has no practical value.

halohsu · 2023-11-20T11:15:50Z

At present, after Lens or kubectl edit, the configuration will still be lost after a few minutes. A good method is to close the built-in ServiceMonitor:

kubectl delete -n gpu-operator servicemonitors.monitoring.coreos.com nvidia-dcgm-exporter

derselbst · 2023-11-30T11:53:27Z

I just ran into the same problem. I tried using a similar relabeling for the ServiceMonitor. But in my case, the helm chart failed to install when using the example config:

error validating data: [ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "source_labels" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings, ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "target_label" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings

I had to change it to targetLabel and sourceLabel to make the installation work. Yet, I don't see any relabeling taking effect.

        relabelings:
          - sourceLabels: [ __meta_kubernetes_pod_node_name ]
            action: replace
            targetLabel: kubernetes_node
          - sourceLabels: [ __meta_kubernetes_pod_container_name ]
            action: replace
            targetLabel: container
          - sourceLabels: [ __meta_kubernetes_namespace ]
            action: replace
            targetLabel: namespace
          - sourceLabels: [ __meta_kubernetes_pod_name ]
            action: replace
            targetLabel: pod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: The configuration of relabeling still does not take effect! #605

bug: The configuration of relabeling still does not take effect! #605

halohsu commented Nov 6, 2023

halohsu commented Nov 13, 2023

tariq1890 commented Nov 15, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

derselbst commented Nov 30, 2023

bug: The configuration of relabeling still does not take effect! #605

bug: The configuration of relabeling still does not take effect! #605

Comments

halohsu commented Nov 6, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

halohsu commented Nov 13, 2023

tariq1890 commented Nov 15, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

halohsu commented Nov 20, 2023

derselbst commented Nov 30, 2023