Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cruise Control metrics reporter failing on k8s cluster with cgroup v2 #1041

Open
2 tasks done
robinvanderstraeten-klarrio opened this issue Aug 16, 2023 · 3 comments
Open
2 tasks done
Labels
community help wanted Extra attention is needed triaged root-cause of the bug is known

Comments

@robinvanderstraeten-klarrio

Description

Cruise control currently does not support running on a cluster with cgroup v2 when the configuration cruise.control.metrics.reporter.kubernetes.mode is set to true. (see linkedin/cruise-control#1873)
Koperator always sets this to true (https://github.com/banzaicloud/koperator/blob/v0.25.1/pkg/resources/kafka/configmap.go#L105) and AFAIK, there is currently no way to override this configuration.

Expected Behavior

The Cruise Control metrics collector should collect and publish metrics about the Kafka brokers.

Actual Behavior

The Cruise Control metrics collector crashes. The following appears once per minute in the logs of every broker:

[2023-08-16 14:28:31,040] WARN Failed reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.io.FileNotFoundException: /sys/fs/cgroup/cpu/cpu.cfs_quota_us (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.readFile(ContainerMetricUtils.java:62)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getCpuQuota(ContainerMetricUtils.java:42)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getContainerProcessCpuLoad(ContainerMetricUtils.java:92)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.getCpuMetric(MetricsUtils.java:409)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.reportCpuUtils(CruiseControlMetricsReporter.java:449)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.run(CruiseControlMetricsReporter.java:367)
        at java.base/java.lang.Thread.run(Thread.java:829)

This also has a side effect: Cruise Control doesn't seem to be able to deal with the fact that it is not getting these metrics. It's memory usage grows until it is eventually OOM killed.

Affected Version

Seen on version 0.24.1.
Though this will be a problem on all versions where cruise.control.metrics.reporter.kubernetes.mode gets set to true.

Steps to Reproduce

  1. Deploy a Kubernetes cluster with nodes that have cgroup v2.
  2. Deploy koperator.
  3. Deploy a basic KafkaCluster. Any configuration that also causes Cruise Control to be deployed should work.

Checklist

@panyuenlau
Copy link
Member

Thanks for reporting this, @robinvanderstraeten-klarrio! We've seen this behavior internally but didn't get the chance to create a dedicated GitHub issue

@robinvanderstraeten-klarrio
Copy link
Author

Reading through the Cruise Control issue, it seems that simply removing the cruise.control.metrics.reporter.kubernetes.mode would fix this, but I'm not too knowledgeable about Cruise Control in general and the impact that this would have on a production deployment.
If this would be a good solution, I'd be happy to contribute it.

@panyuenlau
Copy link
Member

I don't think we should remove the cruise.control.metrics.reporter.kubernetes.mode configuration, this configuration was added to resolve CPU utilization reporting issue, see #463

Perhaps the best way is to wait for upstream CC to fix their issue with cgroups v2 so we can adapt in Koperator

@panyuenlau panyuenlau added triaged root-cause of the bug is known and removed need-to-be-triaged labels Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community help wanted Extra attention is needed triaged root-cause of the bug is known
Projects
None yet
Development

No branches or pull requests

2 participants