Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerOOMEventsDelta not capturing OOMKill on container exit #858

Open
danielgblanco opened this issue Aug 23, 2023 · 1 comment
Open
Labels
bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@danielgblanco
Copy link

danielgblanco commented Aug 23, 2023

Description

We're trying to create dashboards and alerts that capture transient states of Kubernetes container. In particular, we're interested in tracking Error and OOMKilled termination states. AFAICT the New Relic integration is not always able to capture OOMKills correctly when the container restarts (comparing to kube_pod_container_status_last_terminated_reason), because at the moment it scrapes the Kubelet the container has already been restarted and even though at some point in between scrapes the status changed to Terminated and the reason to OOMKilled, as it is not the current state, it never gets reported.

My hope with the new containerOOMEventsDelta attribute was that the NRI integration would be able to capture those states, and return the number of times containers had been OOM kills in between scrapes. What I'm seeing is that the following occurs:

  1. Main container process is OOM Killed
  2. If the NRI integration manages to scrape the Kubelet when the container is in Terminated state, it produced a K8sContainerSample with state = 'Terminated' and reason = 'OOMKilled'. If the NRI integration does not catch the container in Terminated state, that information is lost.
  3. containerOOMEventsDelta remains at 0

I shall mention that containerOOMEventsDelta is working as expected when it's a child process the one that's killed, not the main container. This is a great addition, and something we'd been waiting for (as mentioned in https://www.netice9.com/blog/guide-to-oomkill-alerting-in-kubernetes-clusters OOM kills in child processes can sometimes go unnoticed). I just hoped that containerOOMEventsDelta would also include kills on the main container.

Expected Behavior

  1. Main container process is OOM Killed
  2. If the NRI integration manages to scrape the Kubelet when the container is in Terminated state, it produced a K8sContainerSample with state = 'Terminated' and reason = 'OOMKilled'. If the NRI integration does not catch the container in Terminated state, that information is lost.
  3. containerOOMEventsDelta is reported as 1

Troubleshooting or NR Diag results

Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information

Steps to Reproduce

  1. Saturate memory on main container
  2. Wait for OOM kill

Your Environment

Kubernetes 1.24
nri-kubernetes v3.15.1

Additional context

Add any other context about the problem here. For example, relevant community posts or support tickets.

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

@danielgblanco danielgblanco added the bug Categorizes issue or PR as related to a bug. label Aug 23, 2023
@workato-integration
Copy link

@davidgit davidgit added bug Categorizes issue or PR as related to a bug. and removed bug Categorizes issue or PR as related to a bug. labels Sep 5, 2023
@svetlanabrennan svetlanabrennan added triage/pending Issue or PR is pending for triage and prioritization. triage/in-progress Issue or PR is in the process of being triaged. and removed triage/pending Issue or PR is pending for triage and prioritization. labels Sep 6, 2023
@davidgit davidgit added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/in-progress Issue or PR is in the process of being triaged. labels Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants