Simple strategy with OOMKill data #257

LeaveMyYard · 2024-04-04T10:22:02Z

From the testing I did I desided to create a separate strategy, that might replace the default one in the future. Motivation is:

We need to write a prometheus query that does exactly what we need: returns the maximum exceeded memory limit
There is a metric in kube-state-metrics called kube_pod_container_status_terminated_reason that should give a point in time when the container is terminated, but because OOMKilled event is normally really fast, it does not end up in this metric, so we need to use kube_pod_container_status_last_terminated_reason, and with that a new edge case might appear: a user might change a memory limit after OOMKilled (note: changing limits should also require a new termination, but we need to test if there is something we are missing)

One possibility is to try to use some prometheus windows, but I feel like that will just add more bugs, while not fact that fix something (also performance considerations)

But from my current tests, current OOMKill metric is doing it's job. But, still, I propose to first give it some time and ask people to test how it works.

aantn · 2024-04-15T10:05:08Z

Thanks, nice work with finding the right metric. Two major comments so far:

I'm getting the following error when running this on an OOMKilled workflow:

[12:59:15] ERROR    An unexpected error occurred                                                                                                                runner.py:336
                    Traceback (most recent call last):
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 329, in run
                        result = await self._collect_result()
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 283, in _collect_result
                        scans = await asyncio.gather(*scans_tasks)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 243, in _gather_object_allocations
                        recommendation = await self._calculate_object_recommendations(k8s_object)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/runner.py", line 188, in _calculate_object_recommendations
                        metrics = await prometheus_loader.gather_data(
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/loader.py", line 113, in gather_data
                        return {
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/loader.py", line 114, in <dictcomp>
                        MetricLoader.__name__: await self.loader.gather_data(object, MetricLoader, period, step)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py", line 192,
                    in gather_data
                        data = await metric_loader.load_data(object, period, step)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 182, in load_data
                        result = await self.query_prometheus(
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 148, in query_prometheus
                        return await loop.run_in_executor(self.executor, lambda: self._query_prometheus_sync(data))
                      File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line
                    58, in run
                        result = self.fn(*self.args, **self.kwargs)
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 148, in <lambda>
                        return await loop.run_in_executor(self.executor, lambda: self._query_prometheus_sync(data))
                      File "/Users/natanyellin/Projects/krr/robusta_krr/core/integrations/prometheus/metrics/base.py", line 129, in _query_prometheus_sync
                        response = self.prometheus.safe_custom_query(query=data.query)
                      File "/opt/homebrew/lib/python3.9/site-packages/prometrix/connect/custom_connect.py", line 89, in safe_custom_query
                        raise PrometheusApiClientException(
                    prometheus_api_client.exceptions.PrometheusApiClientException: HTTP Status Code 422
                    (b'{"status":"error","errorType":"execution","error":"found duplicate series for the match group {container=\\"container-exclude\\",
                    pod=\\"supabase-b78d6fdbc-qtnd5\\"} on the right hand-side of the operation:
                    [{__name__=\\"kube_pod_container_status_last_terminated_reason\\", container=\\"container-exclude\\", endpoint=\\"http\\",
                    instance=\\"10.44.3.5:8080\\", job=\\"kube-state-metrics\\", namespace=\\"default\\", pod=\\"supabase-b78d6fdbc-qtnd5\\",
                    reason=\\"OOMKilled\\", service=\\"robusta-kube-state-metrics\\", uid=\\"75d75a1e-b8ef-4eab-9c35-1eb3ceb73441\\"},
                    {__name__=\\"kube_pod_container_status_last_terminated_reason\\", container=\\"container-exclude\\", endpoint=\\"http\\",
                    instance=\\"10.44.2.25:8080\\", job=\\"kube-state-metrics\\", namespace=\\"default\\", pod=\\"supabase-b78d6fdbc-qtnd5\\",
                    reason=\\"OOMKilled\\", service=\\"robusta-kube-state-metrics\\", uid=\\"75d75a1e-b8ef-4eab-9c35-1eb3ceb73441\\"}];many-to-many matching
                    not allowed: matching labels must be unique on one side"}')

I am OK with the decision to disable this by default, but lets put it behind a cli flag inside the existing simple strategy instead of cloning that and making a new strategy. This will let us accomplish the same goal of rolling this out gradually without needing to duplicate the whole strategy.

I will complete a more detailed review once we fix those two items.

aantn · 2024-04-18T06:07:20Z

I'm still getting the above error, even on latest. Let me know when it is fixed and I will test again.

aantn

Extremely nice work. Left minor comments.

robusta_krr/strategies/simple.py

aantn · 2024-04-29T08:10:43Z

I think it's fine. We want to draw attention to it.

…

On Mon, Apr 29, 2024, 11:04 Pavel Zhukov ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In robusta_krr/strategies/simple.py <#257 (comment)>: > @@ -101,6 +136,17 @@ def __calculate_memory_proposal( self, history_data: MetricsPodData, object_data: K8sObjectData ) -> ResourceRecommendation: data = history_data["MaxMemoryLoader"] + info: list[str] = [] + + if self.settings.use_oomkill_data: + max_oomkill_data = history_data["MaxOOMKilledMemoryLoader"] + max_oomkill_value = ( + np.max([values[0, 1] for values in max_oomkill_data.values()]) if len(max_oomkill_data) > 0 else 0 + ) + if max_oomkill_value != 0: + info.append("OOMKill detected") Should it be? It will look like an error, and it is really not — Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADYUB4BZ5KNPCLGIFTKY43Y7X5I3AVCNFSM6AAAAABFW7HSYKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMRXG44TSNRXGI> . You are receiving this because you commented.Message ID: ***@***.***>

into oom-kill-data-usage

Simple strategy with OOMKill data

bfdb815

LeaveMyYard self-assigned this Apr 4, 2024

LeaveMyYard mentioned this pull request Apr 4, 2024

added oomkill detection #74

Closed

Merge oom into simple, minor other improvements

cc4a5cc

Fix join bug in MaxOOMKilledMemoryLoader

08dc060

LeaveMyYard requested a review from aantn April 22, 2024 12:32

Minor improvements

22dc630

aantn approved these changes Apr 28, 2024

View reviewed changes

LeaveMyYard and others added 4 commits April 29, 2024 16:52

Colored oom message, comments improvements

d841d79

Merge branch 'main' into oom-kill-data-usage

d05ae9b

One more comment improvement

fb32618

Merge branch 'oom-kill-data-usage' of https://github.com/robusta-dev/krr

d15ad3e

into oom-kill-data-usage

LeaveMyYard merged commit dd13dee into main Apr 29, 2024
2 checks passed

LeaveMyYard deleted the oom-kill-data-usage branch April 29, 2024 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple strategy with OOMKill data #257

Simple strategy with OOMKill data #257

LeaveMyYard commented Apr 4, 2024

aantn commented Apr 15, 2024

aantn commented Apr 18, 2024

aantn left a comment

aantn commented Apr 29, 2024 via email

Simple strategy with OOMKill data #257

Simple strategy with OOMKill data #257

Conversation

LeaveMyYard commented Apr 4, 2024

aantn commented Apr 15, 2024

aantn commented Apr 18, 2024

aantn left a comment

Choose a reason for hiding this comment

aantn commented Apr 29, 2024 via email