-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: The configuration of relabeling still does not take effect! #605
Comments
Help!!! |
@bluemiaomiao Can you share the yaml manifest of the rendered |
➜ ~ kubectl get daemonsets -n gpu-operator nvidia-dcgm-exporter -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "3"
nvidia.com/last-applied-hash: "80895247"
openshift.io/scc: nvidia-dcgm-exporter
creationTimestamp: "2023-09-09T08:40:45Z"
generation: 3
labels:
app: nvidia-dcgm-exporter
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.9.0
name: nvidia-dcgm-exporter
namespace: gpu-operator
ownerReferences:
- apiVersion: nvidia.com/v1
blockOwnerDeletion: true
controller: true
kind: ClusterPolicy
name: cluster-policy
uid: f1b90f5e-45ba-4270-a048-ab210729fa91
resourceVersion: "237868434"
uid: 475fafee-3427-4b7b-8488-042ea3ef82df
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: nvidia-dcgm-exporter
template:
metadata:
creationTimestamp: null
labels:
app: nvidia-dcgm-exporter
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.9.0
spec:
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.6-3.1.9-ubuntu20.04
imagePullPolicy: IfNotPresent
name: nvidia-dcgm-exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- args:
- until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
nvidia container stack to be setup; sleep 5; done
command:
- sh
- -c
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
imagePullPolicy: IfNotPresent
name: toolkit-validation
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run/nvidia
mountPropagation: HostToContainer
name: run-nvidia
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: "true"
priorityClassName: system-node-critical
restartPolicy: Always
runtimeClassName: nvidia
schedulerName: default-scheduler
securityContext: {}
serviceAccount: nvidia-dcgm-exporter
serviceAccountName: nvidia-dcgm-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- hostPath:
path: /run/nvidia
type: ""
name: run-nvidia
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 26
desiredNumberScheduled: 26
numberAvailable: 26
numberMisscheduled: 0
numberReady: 26
observedGeneration: 3
updatedNumberScheduled: 26 |
When I manually add the configuration through |
This is a huge bug that has affected our production line monitoring and alerting, and obtaining Pod's IP has no practical value. |
At present, after Lens or kubectl edit, the configuration will still be lost after a few minutes. A good method is to close the built-in ServiceMonitor: kubectl delete -n gpu-operator servicemonitors.monitoring.coreos.com nvidia-dcgm-exporter |
I just ran into the same problem. I tried using a similar relabeling for the ServiceMonitor. But in my case, the helm chart failed to install when using the example config:
I had to change it to relabelings:
- sourceLabels: [ __meta_kubernetes_pod_node_name ]
action: replace
targetLabel: kubernetes_node
- sourceLabels: [ __meta_kubernetes_pod_container_name ]
action: replace
targetLabel: container
- sourceLabels: [ __meta_kubernetes_namespace ]
action: replace
targetLabel: namespace
- sourceLabels: [ __meta_kubernetes_pod_name ]
action: replace
targetLabel: pod |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The relabeling function is supported in the values. yaml file in the official repository:
AND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:
My Helm releases:
But the configuration of relabeling still does not take effect!
Others:
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
None.
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
The text was updated successfully, but these errors were encountered: