Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple strategy with OOMKill data #257

Merged
merged 8 commits into from
Apr 29, 2024
Merged

Simple strategy with OOMKill data #257

merged 8 commits into from
Apr 29, 2024

Conversation

LeaveMyYard
Copy link
Contributor

From the testing I did I desided to create a separate strategy, that might replace the default one in the future. Motivation is:

  • We need to write a prometheus query that does exactly what we need: returns the maximum exceeded memory limit
  • There is a metric in kube-state-metrics called kube_pod_container_status_terminated_reason that should give a point in time when the container is terminated, but because OOMKilled event is normally really fast, it does not end up in this metric, so we need to use kube_pod_container_status_last_terminated_reason, and with that a new edge case might appear: a user might change a memory limit after OOMKilled (note: changing limits should also require a new termination, but we need to test if there is something we are missing)

One possibility is to try to use some prometheus windows, but I feel like that will just add more bugs, while not fact that fix something (also performance considerations)

But from my current tests, current OOMKill metric is doing it's job. But, still, I propose to first give it some time and ask people to test how it works.

@LeaveMyYard LeaveMyYard self-assigned this Apr 4, 2024
@aantn
Copy link
Contributor

aantn commented Apr 15, 2024

Thanks, nice work with finding the right metric. Two major comments so far:

  1. I'm getting the following error when running this on an OOMKilled workflow:
[12:59:15] ERROR    An unexpected error occurred                                                                                                                runner.py:336
                    Traceback (most recent call last):
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 329, in run
                        result = await self._collect_result()
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 283, in _collect_result
                        scans = await asyncio.gather(*scans_tasks)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 243, in _gather_object_allocations
                        recommendation = await self._calculate_object_recommendations(k8s_object)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 188, in _calculate_object_recommendations
                        metrics = await prometheus_loader.gather_data(
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/loader.py", line 113, in gather_data
                        return {
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/loader.py", line 114, in <dictcomp>
                        MetricLoader.__name__: await self.loader.gather_data(object, MetricLoader, period, step)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py", line 192,
                    in gather_data
                        data = await metric_loader.load_data(object, period, step)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 182, in load_data
                        result = await self.query_prometheus(
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 148, in query_prometheus
                        return await loop.run_in_executor(self.executor, lambda: self._query_prometheus_sync(data))
                      File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line
                    58, in run
                        result = self.fn(*self.args, **self.kwargs)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 148, in <lambda>
                        return await loop.run_in_executor(self.executor, lambda: self._query_prometheus_sync(data))
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 129, in _query_prometheus_sync
                        response = self.prometheus.safe_custom_query(query=data.query)
                      File "/opt/homebrew/lib/python3.9/site-packages/prometrix/connect/custom_connect.py", line 89, in safe_custom_query
                        raise PrometheusApiClientException(
                    prometheus_api_client.exceptions.PrometheusApiClientException: HTTP Status Code 422
                    (b'{"status":"error","errorType":"execution","error":"found duplicate series for the match group {container=\\"container-exclude\\",
                    pod=\\"supabase-b78d6fdbc-qtnd5\\"} on the right hand-side of the operation:
                    [{__name__=\\"kube_pod_container_status_last_terminated_reason\\", container=\\"container-exclude\\", endpoint=\\"http\\",
                    instance=\\"10.44.3.5:8080\\", job=\\"kube-state-metrics\\", namespace=\\"default\\", pod=\\"supabase-b78d6fdbc-qtnd5\\",
                    reason=\\"OOMKilled\\", service=\\"robusta-kube-state-metrics\\", uid=\\"75d75a1e-b8ef-4eab-9c35-1eb3ceb73441\\"},
                    {__name__=\\"kube_pod_container_status_last_terminated_reason\\", container=\\"container-exclude\\", endpoint=\\"http\\",
                    instance=\\"10.44.2.25:8080\\", job=\\"kube-state-metrics\\", namespace=\\"default\\", pod=\\"supabase-b78d6fdbc-qtnd5\\",
                    reason=\\"OOMKilled\\", service=\\"robusta-kube-state-metrics\\", uid=\\"75d75a1e-b8ef-4eab-9c35-1eb3ceb73441\\"}];many-to-many matching
                    not allowed: matching labels must be unique on one side"}')
  1. I am OK with the decision to disable this by default, but lets put it behind a cli flag inside the existing simple strategy instead of cloning that and making a new strategy. This will let us accomplish the same goal of rolling this out gradually without needing to duplicate the whole strategy.

I will complete a more detailed review once we fix those two items.

@aantn
Copy link
Contributor

aantn commented Apr 18, 2024

I'm still getting the above error, even on latest. Let me know when it is fixed and I will test again.

@LeaveMyYard LeaveMyYard requested a review from aantn April 22, 2024 12:32
Copy link
Contributor

@aantn aantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely nice work. Left minor comments.

robusta_krr/strategies/simple.py Outdated Show resolved Hide resolved
robusta_krr/strategies/simple.py Outdated Show resolved Hide resolved
robusta_krr/strategies/simple.py Outdated Show resolved Hide resolved
robusta_krr/strategies/simple.py Outdated Show resolved Hide resolved
robusta_krr/strategies/simple.py Show resolved Hide resolved
@aantn
Copy link
Contributor

aantn commented Apr 29, 2024 via email

@LeaveMyYard LeaveMyYard merged commit dd13dee into main Apr 29, 2024
2 checks passed
@LeaveMyYard LeaveMyYard deleted the oom-kill-data-usage branch April 29, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants