containerOOMEventsDelta
not capturing OOMKill
on container exit
#858
Labels
bug
Categorizes issue or PR as related to a bug.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
Description
We're trying to create dashboards and alerts that capture transient states of Kubernetes container. In particular, we're interested in tracking
Error
andOOMKilled
termination states. AFAICT the New Relic integration is not always able to capture OOMKills correctly when the container restarts (comparing tokube_pod_container_status_last_terminated_reason
), because at the moment it scrapes the Kubelet the container has already been restarted and even though at some point in between scrapes thestatus
changed toTerminated
and thereason
toOOMKilled
, as it is not the current state, it never gets reported.My hope with the new
containerOOMEventsDelta
attribute was that the NRI integration would be able to capture those states, and return the number of times containers had been OOM kills in between scrapes. What I'm seeing is that the following occurs:Terminated
state, it produced aK8sContainerSample
withstate = 'Terminated'
andreason = 'OOMKilled'
. If the NRI integration does not catch the container inTerminated
state, that information is lost.containerOOMEventsDelta
remains at0
I shall mention that
containerOOMEventsDelta
is working as expected when it's a child process the one that's killed, not the main container. This is a great addition, and something we'd been waiting for (as mentioned in https://www.netice9.com/blog/guide-to-oomkill-alerting-in-kubernetes-clusters OOM kills in child processes can sometimes go unnoticed). I just hoped thatcontainerOOMEventsDelta
would also include kills on the main container.Expected Behavior
Terminated
state, it produced aK8sContainerSample
withstate = 'Terminated'
andreason = 'OOMKilled'
. If the NRI integration does not catch the container inTerminated
state, that information is lost.containerOOMEventsDelta
is reported as1
Troubleshooting or NR Diag results
Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information
Steps to Reproduce
Your Environment
Kubernetes 1.24
nri-kubernetes v3.15.1
Additional context
Add any other context about the problem here. For example, relevant community posts or support tickets.
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):
The text was updated successfully, but these errors were encountered: