Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory leak due to uncontrolled accumulation of health.log entries in Podman 5.x #25473

Open
gaurangomar opened this issue Mar 5, 2025 · 15 comments · May be fixed by #25520
Open

Excessive memory leak due to uncontrolled accumulation of health.log entries in Podman 5.x #25473

gaurangomar opened this issue Mar 5, 2025 · 15 comments · May be fixed by #25520
Assignees
Labels
jira kind/bug Categorizes issue or PR as related to a bug.

Comments

@gaurangomar
Copy link

gaurangomar commented Mar 5, 2025

Issue Description

When using healthchecks in Podman 5.x, we’ve observed that the internal health log grows continuously (into the thousands of entries) and never prunes older records, In our tests, the health.log field in the container’s inspect output eventually contains over 12,000 records, which keeps growing by time. This contrasts with Podman 4.x, which typically keeps only ~5 log entries. Furthermore, running top on the host shows unusually high memory usage by the /usr/bin/podman healthcheck process over time. These symptoms suggest a memory leak tied to Podman’s healthcheck mechanism in version 5.x.

Image

Steps to reproduce the issue

Steps to Reproduce:

  • Healthcheck Configuration:
    Use a healthcheck configuration identical to the one that worked in Podman 4.x. For example:
"Healthcheck": {
    "Test": [
        "CMD",
        "curl",
        "-f",
        "http://agent:8080/health"
    ],
    "Interval": 30000000000,
    "Timeout": 10000000000,
    "Retries": 5
}
  • Run the Container:
    Start a container with this configuration on Podman 5.x.

  • Monitor Health Log:
    After the container runs for a while, run podman inspect and check the State.Health.Log field. In Podman 5.x, it continuously accumulates records (e.g., over 12,000 entries) rather than being capped (as observed in Podman 4.x, which only shows about 5 entries).

  • Observe Memory Usage:
    Use monitoring tools (e.g., top) to observe the memory usage. There is a significant and continuous increase in memory consumption, particularly in kernel memory (kmalloc-2k and kmalloc-4k slabs).

This is high usage in top command for healthcheck is randomly visible and we are running 8 containers.

Describe the results you received

When using healthchecks in Podman 5.x, we’ve observed that the internal health log continuously grows instead of being capped at a few entries (as seen in Podman 4.x). In our tests, the health.log field in the container’s inspect output eventually contains over 12,000 records compared to the expected ~5 entries in version 4.x. This uncontrolled log growth correlates with a continuous increase in memory usage.

Describe the results you expected

the mem usages should not increace, and it should have limited logs

podman info output

host:
  arch: amd64
  buildahVersion: 1.37.5
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.12-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: b3f4044f63d830049366c05304a1d5d558571e85'
  cpuUtilization:
    idlePercent: 76.81
    systemPercent: 6.73
    userPercent: 16.46
  cpus: 2
  databaseBackend: sqlite
  distribution:
    distribution: ol
    variant: server
    version: "9.5"
  eventLogger: file
  freeLocks: 2026
  hostname: k-jambunatha-tf64-ecp-edge-multi-int-openstack-perf-1771036--ed
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 2001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 2002
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.15.0-304.171.4.1.el9uek.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 809750528
  memTotal: 3803951104
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.12.2-1.el9_5.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.12.2
    package: netavark-1.12.2-1.el9.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.12.2
  ociRuntime:
    name: crun
    package: crun-1.16.1-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.16.1
      commit: afa829ca0122bd5e1d67f1f38e6cc348027e3c32
      rundir: /run/user/2002/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240806.gee36266-2.el9.x86_64
    version: |
      pasta 0^20240806.gee36266-2.el9.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/2002/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 2469085184
  swapTotal: 4194299904
  uptime: 312h 40m 36.00s (Approximately 13.00 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - container-registry.oracle.com
store:
  configFile: /home/user/.config/containers/storage.conf
  containerStore:
    number: 10
    paused: 0
    running: 10
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/user/.local/share/containers/storage
  graphRootAllocated: 40961572864
  graphRootUsed: 2026479616
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 10
  runRoot: /run/user/2002/containers
  transientStore: false
  volumePath: /home/user/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.2
  Built: 1735903242
  BuiltTime: Fri Jan  3 06:20:42 2025
  GitCommit: ""
  GoVersion: go1.22.9 (Red Hat 1.22.9-2.el9_5)
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.2

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

No

Additional environment details

podman --version

podman version 5.2.2

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@gaurangomar gaurangomar added the kind/bug Categorizes issue or PR as related to a bug. label Mar 5, 2025
@giuseppe
Copy link
Member

giuseppe commented Mar 5, 2025

please share the exact command you've used to create the container, and what memory usage you are considering

@gaurangomar
Copy link
Author

We are using libpod APIs for creating container, these comtainers are running and 4-5 days.

curl -X POST -H "Content-Type: application/json" --unix-socket /run/user/$UID/podman/podman.sock http://d/v4.1.1/libpod/containers/create -d @<json>

We have the below health check config apart from the other container configurations.

...
"Healthcheck": {
    "Test": [
        "CMD",
        "curl",
        "-f",
        "http://agent:8080/health"
    ],
    "Interval": 30000000000,
    "Timeout": 10000000000,
    "Retries": 5
}
...

Now when I am running the same container in different machines having different podman versions

In Podman v5.2.2
Image

in Podman v4.4
Image

@mheon
Copy link
Member

mheon commented Mar 5, 2025

This is API specific. We added a new field to the REST API in 5.4, healthMaxLogCount, controlling the maximum number of log entries allowed, with 0 being infinite. As such, taking working JSON from a previous version without this field, and putting it into 5.4, will cause unbounded log growth. We probably should've made this one a pointer so we could tell if it was actually set, but at this point that'd be a breaking change. The good news is that containers created from the CLI are getting sane defaults
@Luap99 @Honny1 What do y'all think bout this one... It feels like a bug, but I'm not sure it's fixable without an API breaking change?

@mheon
Copy link
Member

mheon commented Mar 5, 2025

In the meantime, setting healthMaxLogCount to 5 in your container create JSON will work around the problem and restore expected behavior

@mheon
Copy link
Member

mheon commented Mar 5, 2025

(I did not verify if the Docker API is producing containers with the same issue)

@Luap99
Copy link
Member

Luap99 commented Mar 6, 2025

This is API specific. We added a new field to the REST API in 5.4, healthMaxLogCount, controlling the maximum number of log entries allowed, with 0 being infinite. As such, taking working JSON from a previous version without this field, and putting it into 5.4, will cause unbounded log growth. We probably should've made this one a pointer so we could tell if it was actually set, but at this point that'd be a breaking change. The good news is that containers created from the CLI are getting sane defaults @Luap99 @Honny1 What do y'all think bout this one... It feels like a bug, but I'm not sure it's fixable without an API breaking change?

Well that is just the the typical bug I have complaining about for a while. Defaults should not be set in the cli. They must be set on the server side so API users actually get the expected default.

It doesn't need to be pointer for the server to set a default though so we can fix it. But yes it should be a pointer ideally so the client doesn't always overwrite it. That is for specgen at least.

But looking at this the bug is even worse... This is part of the actual container config in libpod which means all containers created on previous versions will be set to unlimited.

Also healthMaxLogCount was merged for v5.3 so I don't get why this is reported against 5.2.2? I guess we did end up backporting this into RHEL?

If this would not have been report months later after this was released I would have just fixed the types to pointers and break the specgen API. Given this is out for a while it doesn't seem to be an option anymore...

@gaurangomar
Copy link
Author

gaurangomar commented Mar 6, 2025

@Luap99 yes we were running on podman version 4.4, and few days back we had upgraded one of our server to use new podman (i.e. 5.2.2). and started seeing some mem issues, then after doing some investigation we got to know this this issue.

@mheon I tried the above configuration with podman v5.2.1 and using this configuration I am able to restrict the maximum number of log entries allowed, but with this also I can see the container memory is still growing, is it because of some other configurations that we can change? like: HealthLogDestination or HealthMaxLogSize
This is my health config

               "Healthcheck": {
                    "Test": [
                         "CMD-SHELL",
                         "curl -f --noproxy '*' http://127.0.0.1:8282/health"
                    ],
                    "Interval": 30000000000,
                    "Timeout": 10000000000,
                    "Retries": 5
               },
               "HealthcheckOnFailureAction": "none",
               "HealthLogDestination": "local",
               "HealthcheckMaxLogCount": 2,

Additionally this is working when we are creating a new container, can we update existing containers? I tried below but I am not able to update the configs. curl -X POST -H "Content-Type: application/json" --unix-socket /run/user/$UID/podman/podman.sock http://d/v4.1.1/libpod/containers/test/update -d @update.json

update.json content

{
  "HealthMaxLogCount": 2
}

@Honny1
Copy link
Member

Honny1 commented Mar 6, 2025

(I did not verify if the Docker API is producing containers with the same issue)

The Compat API uses fixed values from constants. So it should be fine.

@Honny1
Copy link
Member

Honny1 commented Mar 6, 2025

@mheon I tried the above configuration with podman v5.2.1 and using this configuration I am able to restrict the maximum number of log entries allowed, but with this also I can see the container memory is still growing, is it because of some other configurations that we can change? like: HealthLogDestination or HealthMaxLogSize This is my health config

           "Healthcheck": {
                "Test": [
                     "CMD-SHELL",
                     "curl -f --noproxy '*' http://127.0.0.1:8282/health"
                ],
                "Interval": 30000000000,
                "Timeout": 10000000000,
                "Retries": 5
           },
           "HealthcheckOnFailureAction": "none",
           "HealthLogDestination": "local",
           "HealthcheckMaxLogCount": 2,

I think HealthMaxLogSize is also unlimited. Each log contains the entire output of the HC command. I think that might be the problem as well.

Additionally this is working when we are creating a new container, can we update existing containers? I tried below but I am not able to update the configs. curl -X POST -H "Content-Type: application/json" --unix-socket /run/user/$UID/podman/podman.sock http://d/v4.1.1/libpod/containers/test/update -d @update.json

update.json content

{
  "HealthMaxLogCount": 2
}

The option to update the Healthcheck configuration is unfortunately only available in podman v5.4.0.

@Honny1
Copy link
Member

Honny1 commented Mar 6, 2025

This is API specific. We added a new field to the REST API in 5.4, healthMaxLogCount, controlling the maximum number of log entries allowed, with 0 being infinite. As such, taking working JSON from a previous version without this field, and putting it into 5.4, will cause unbounded log growth. We probably should've made this one a pointer so we could tell if it was actually set, but at this point that'd be a breaking change. The good news is that containers created from the CLI are getting sane defaults @Luap99 @Honny1 What do y'all think bout this one... It feels like a bug, but I'm not sure it's fixable without an API breaking change?

I think I have an idea how to solve this without changing the API. I need to run tests.

@Honny1 Honny1 self-assigned this Mar 6, 2025
@Luap99
Copy link
Member

Luap99 commented Mar 6, 2025

@Honny1 I know how to fix it, for the specgen part you must set the default before we decode the json

// we have to set the default before we decode to make sure the correct default is set when the field is unset

For the container config in libpod we must break the API and change the field to a pointer so that we know old container field was unset. Since libpod is internal anyway we can just do that.

@Honny1
Copy link
Member

Honny1 commented Mar 6, 2025

Yes, that's exactly what I want to do. But I forgot about the libpod part. I will prepare PR.

@gaurangomar
Copy link
Author

Can we do some changes in config(i.e. ~/.config/containers/containers.conf) so all the new containers created use the same config and we do not have to pass this in libpod APIs?

@mheon
Copy link
Member

mheon commented Mar 6, 2025

If that isn't possible right now, it seems like a very reasonable feature request

@Luap99
Copy link
Member

Luap99 commented Mar 6, 2025

Note adding containers.conf filed for this will not work properly when specgen has no pointers for the values with podman-remote. Because the client must pass an unset value so the server can actually lookup the real result, the only way to have that would be a pointer.

@Honny1 Honny1 added jira and removed jira labels Mar 10, 2025
Honny1 added a commit to Honny1/podman that referenced this issue Mar 10, 2025
… HealthCheckLogSize

GoLang sets unset values to the default value of the type. That is, for the log destination empty string and count and size is set to 0. But this means that the size and count is unbounded and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 10, 2025
… HealthCheckLogSize

GoLang sets unset values to the default value of the type. That is, for the log destination empty string and count and size are set to 0. But this means that the size and count are unbounded and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 10, 2025
GoLang sets unset values to the default value of the type. That is, for the log destination empty string and count and size are set to 0. But this means that the size and count are unbounded and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 10, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 11, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 11, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 12, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 12, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 12, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473
Fixes: https://issues.redhat.com/browse/RHEL-83262

Signed-off-by: Jan Rodák <[email protected]>
Honny1 added a commit to Honny1/podman that referenced this issue Mar 12, 2025
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.

Fixes: containers#25473
Fixes: https://issues.redhat.com/browse/RHEL-83262

Signed-off-by: Jan Rodák <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants