Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[newrelic-logging] Default resource limits cause out of memory errors #1500

Open
hero-david opened this issue Oct 8, 2024 · 3 comments
Open
Labels
bug Categorizes issue or PR as related to a bug. triage/pending Issue or PR is pending for triage and prioritization.

Comments

@hero-david
Copy link

Description

An issue has been opened about this before, and the reporter was instructed to ensure that they had upgraded their chart such that memory limit config on the input was present.

We have been struggling with OOM errors and restarts on our pods despite having this config present, and upping the memory allowances of the pod. We have about 50 pods per node.

fluentbit oom
oom

The helm config provided for this was:

newrelic-logging:
  enabled: true
  fluentBit:
    criEnabled: true
  lowDataMode: false
  resources:
    limits:
      memory: 256Mi
  tolerations:
  - effect: NoSchedule
    key: role
    operator: Exists
Date Message
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1360652 (flb-pipeline) total-vm:1307336kB, anon-rss:259736kB, file-rss:19648kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1400772 (fluent-bit) total-vm:1311176kB, anon-rss:259508kB, file-rss:19084kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1400790 (flb-pipeline) total-vm:1311176kB, anon-rss:259652kB, file-rss:19468kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1360626 (fluent-bit) total-vm:1307336kB, anon-rss:259624kB, file-rss:19264kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1201131 (flb-pipeline) total-vm:1483464kB, anon-rss:259504kB, file-rss:19828kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1201113 (fluent-bit) total-vm:1483464kB, anon-rss:259392kB, file-rss:19444kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1266468 (flb-pipeline) total-vm:1487560kB, anon-rss:259188kB, file-rss:19628kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1324063 (fluent-bit) total-vm:1487560kB, anon-rss:259368kB, file-rss:19368kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1324081 (flb-pipeline) total-vm:1487560kB, anon-rss:259476kB, file-rss:19752kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1266420 (fluent-bit) total-vm:1487560kB, anon-rss:259084kB, file-rss:19244kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996

Versions

Helm v3.14.4
Kubernetes (AKS) 1.29.2
Chart: nri-bundle-5.0.81
FluentBit: newrelic/newrelic-fluentbit-output:2.0.0

What happened?

The fluentbit pods were repeatedly killed for using more memory than it's limit, which is set very low. It's CPU was never highly utilised, which does not suggest that the memory increase was due to throttling / not being able to keep up.

What you expected to happen?

The fluentbit should have little to no restarts, and it should never reach 1.5GB of memory used per container.

How to reproduce it?

Using the same versions as listed above, and the same helm values.yaml, deploy an AKS cluster with 50 production workloads per node (2vcpu 8gb) and observe whether there are memory issues.

@hero-david hero-david added bug Categorizes issue or PR as related to a bug. triage/pending Issue or PR is pending for triage and prioritization. labels Oct 8, 2024
@workato-integration
Copy link

@JS-Jake
Copy link

JS-Jake commented Nov 7, 2024

@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS

@hero-david
Copy link
Author

@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS

No, we have simply upped our VM SKU to 16GB (Required for some of our workloads moving forwards anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. triage/pending Issue or PR is pending for triage and prioritization.
Projects
None yet
Development

No branches or pull requests

2 participants