Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory usage after upgrade to version 1.15.6 #28170

Open
kaleshag opened this issue Aug 23, 2024 · 26 comments
Open

High Memory usage after upgrade to version 1.15.6 #28170

kaleshag opened this issue Aug 23, 2024 · 26 comments
Labels
bug Used to indicate a potential bug performance secret/pki

Comments

@kaleshag
Copy link

We have upgraded vault to 1.15.6 and suddenly on high load cluster we are seeing huge spike of Memory utilization, but no difference on low usage clusters.

Same cluster had Memory utilization around 500Mi to 1000Mi, but after upgrade we are seeing huge spike like 2120Mi. We want to know why this version having this bug?

Regards
Kalesha Gagguturu

@heatherezell heatherezell added performance bug Used to indicate a potential bug labels Aug 23, 2024
@kaleshag
Copy link
Author

kaleshag commented Aug 24, 2024

NAME                   CPU(cores)   MEMORY(bytes)   
vault-1                         39m          2677Mi          
vault-2                          19m          507Mi           
vault-3                          38m          513Mi

It was showing 2120Mi and after 19h when I checked Memory utilization increased to 2677Mi, something fishy in this version.
And cache metrics shows

"Name": "vault.cache.hit",
      "Count": 3601,
      "Rate": 360.1,
      "Sum": 3601,
      "Min": 1,
      "Max": 1,
      "Mean": 1,
      "Stddev": 0,
      "Labels": {}

@kaleshag
Copy link
Author

kaleshag commented Aug 24, 2024

And its keep on increasing

NAME                    CPU(cores)   MEMORY(bytes)   
vault-1                          52m          2807Mi          
vault-2                          21m          507Mi           
vault-3                          35m          513Mi

@divyaac
Copy link
Contributor

divyaac commented Aug 26, 2024

Hi @kaleshag , thank you for reporting this! Do you have any logs and server configurations that you could provide us so that we could investigate?

@kaleshag
Copy link
Author

@divyaac I am not finding any logs on the vault and I can only see that suddenly huge spike after just upgrading vault to 1.15.6.
I am attaching the screenshot of spike happened suddenly after vault upgrade. Please let me know what logs exactly you are looking.
image

@kaleshag
Copy link
Author

I am just giving the memory information for the pod which shows higher memory

cat /proc/meminfo
MemTotal:       32655324 kB
MemFree:         7766008 kB
MemAvailable:   20372544 kB
Buffers:         1663800 kB
Cached:          9826560 kB
SwapCached:            0 kB
Active:         14121316 kB
Inactive:        5485656 kB
Active(anon):    8391364 kB
Inactive(anon):     2896 kB
Active(file):    5729952 kB
Inactive(file):  5482760 kB
Unevictable:     2648232 kB
Mlocked:         2648232 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1316 kB
Writeback:             0 kB
AnonPages:      10764944 kB
Mapped:          1019440 kB
Shmem:              6028 kB
KReclaimable:    1855532 kB
Slab:            2251776 kB
SReclaimable:    1855532 kB
SUnreclaim:       396244 kB
KernelStack:       21552 kB
PageTables:        35588 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16327660 kB
Committed_AS:   15790976 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      104748 kB
VmallocChunk:          0 kB
Percpu:           156160 kB
HardwareCorrupted:     0 kB
AnonHugePages:     30720 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     1128448 kB
DirectMap2M:    24033280 kB
DirectMap1G:     9437184 kB
/ # free -g
              total        used        free      shared  buff/cache   available
Mem:             31          11           7           0          13          19
Swap:             0           0           0
/ #

@kaleshag
Copy link
Author

Is there any changes in vault from 1.14.x onwards which causes for more memory utilization?

@kaleshag
Copy link
Author

I just collected some logs from hcdiag and attaching them for your reference, please let me know if you can help something with that.
allocs-1.pdf
allocs.pdf
heap-1.pdf
heap.pdf

@kaleshag
Copy link
Author

Hello Team

Do we have any update on this issue? We are having a lot of issues with this.

@divyaac
Copy link
Contributor

divyaac commented Aug 29, 2024

HI @kaleshag; it seems that the change is related to extra metrics logging. Is the high memory utilization consistent? Or just during startup?

Also - what version of Vault were you using before?
Thank you for the pprofs. Would you be able to provide some server logs during the spike, and your Vault configuration? A copy of the goroutine.pprof would also be helpful.

@kaleshag
Copy link
Author

Hi @divyaac
Yes, the memory utilization is consistent while starting it is around 2Gi and then after some hours it goes to 2.5Gi 2.8Gi.
Previously we were using 1.12.9 vault version and currently upgraded to 1.15.6.
Today I enabled the debug level and found below entries continuously in the log

2024-08-29T10:14:17.423Z [DEBUG] secrets.pki.pki_eb1c25a3: Starting ACME challenge validation engine
2024-08-29T10:14:17.541Z [DEBUG] secrets.pki.pki_34f9f6c4: Loading existing challenge validations on disk
2024-08-29T10:14:17.641Z [DEBUG] secrets.pki.pki_34f9f6c4: Starting ACME challenge validation engine
2024-08-29T10:14:17.860Z [DEBUG] secrets.pki.pki_4b645ccf: Loading existing challenge validations on disk
2024-08-29T10:14:17.861Z [DEBUG] secrets.pki.pki_87a065c4: Loading existing challenge validations on disk
2024-08-29T10:14:17.861Z [DEBUG] secrets.pki.pki_f8308c58: Loading existing challenge validations on disk
2024-08-29T10:14:17.861Z [DEBUG] secrets.pki.pki_c209d38d: Loading existing challenge validations on disk
2024-08-29T10:14:17.862Z [DEBUG] secrets.pki.pki_a4e1fd98: Loading existing challenge validations on disk
2024-08-29T10:14:17.862Z [DEBUG] secrets.pki.pki_6eee6d4a: Loading existing challenge validations on disk
2024-08-29T10:14:17.863Z [DEBUG] secrets.pki.pki_796473ca: Loading existing challenge validations on disk
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_87a065c4: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_6eee6d4a: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_c209d38d: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_796473ca: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_a4e1fd98: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_f8308c58: Starting ACME challenge validation engine
2024-08-29T10:14:17.901Z [DEBUG] secrets.pki.pki_4b645ccf: Starting ACME challenge validation engine
..................
2024-08-29T10:14:32.320Z [DEBUG] secrets.pki.pki_f80706c7: Loading existing challenge validations on disk
2024-08-29T10:14:32.343Z [INFO]  core: post-unseal setup complete
2024-08-29T10:14:32.344Z [DEBUG] secrets.pki.pki_f80706c7: Starting ACME challenge validation engine
2024-08-29T10:14:32.346Z [DEBUG] secrets.pki.pki_3c9ed076: starting to process revocation requests
2024-08-29T10:14:32.346Z [DEBUG] secrets.pki.pki_3c9ed076: gathering first time existing revocations
2024-08-29T10:14:32.347Z [DEBUG] secrets.pki.pki_d901dbd4: starting to process revocation requests
2024-08-29T10:14:32.347Z [DEBUG] secrets.pki.pki_d901dbd4: gathering first time existing revocations
2024-08-29T10:14:32.383Z [DEBUG] secrets.pki.pki_3ecacdf6: starting to process revocation requests
2024-08-29T10:14:32.383Z [DEBUG] secrets.pki.pki_3ecacdf6: gathering first time existing revocations
2024-08-29T10:14:32.383Z [DEBUG] secrets.pki.pki_a6b0be7a: starting to process revocation requests
2024-08-29T10:14:32.383Z [DEBUG] secrets.pki.pki_a6b0be7a: gathering first time existing revocations
2024-08-29T10:14:32.387Z [DEBUG] secrets.pki.pki_a8a121ac: starting to process revocation requests
.................
2024-08-29T10:14:34.000Z [DEBUG] secrets.pki.pki_173e07de: found 0 clusters: []
2024-08-29T10:14:34.000Z [DEBUG] secrets.pki.pki_173e07de: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_67f1e2b6: found 0 clusters: []
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_67f1e2b6: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_d8a694c6: found 0 clusters: []
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_d8a694c6: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_74c65d58: found 0 clusters: []
2024-08-29T10:14:34.001Z [DEBUG] secrets.pki.pki_74c65d58: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_1b12c1d8: found 0 clusters: []
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_1b12c1d8: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_1105995f: found 0 clusters: []
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_1105995f: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_16f33ac7: found 0 clusters: []
2024-08-29T10:14:34.002Z [DEBUG] secrets.pki.pki_16f33ac7: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.041Z [DEBUG] secrets.pki.pki_0c48abf8: starting to process revocation requests
2024-08-29T10:14:34.041Z [DEBUG] secrets.pki.pki_0c48abf8: gathering first time existing revocations
2024-08-29T10:14:34.042Z [DEBUG] secrets.pki.pki_e8e3b428: starting to process revocation requests
2024-08-29T10:14:34.042Z [DEBUG] secrets.pki.pki_e8e3b428: gathering first time existing revocations
2024-08-29T10:14:34.042Z [DEBUG] secrets.pki.pki_ef097e45: found 0 clusters: []
2024-08-29T10:14:34.042Z [DEBUG] secrets.pki.pki_ef097e45: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:14:34.062Z [DEBUG] secrets.pki.pki_08eeca8b: starting to process revocation requests
2024-08-29T10:14:34.062Z [DEBUG] secrets.pki.pki_08eeca8b: gathering first time existing revocations
2024-08-29T10:14:34.062Z [DEBUG] secrets.pki.pki_f5c7ae44: found 0 clusters: []
.................
2024-08-29T10:15:04.682Z [DEBUG] secrets.pki.pki_f69d2cbb: gathering first time existing revocations
2024-08-29T10:15:04.704Z [DEBUG] secrets.pki.pki_f47a93b7: starting to process revocation requests
{"time":"2024-08-29T10:15:04.722230239Z","type":"request","auth":{"client_token":.....................................
....................................................
2024-08-29T10:20:49.420Z [DEBUG] secrets.pki.pki_17403c74: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:20:49.420Z [DEBUG] secrets.pki.pki_d01074f5: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:20:49.420Z [DEBUG] secrets.pki.pki_03b85ea7: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:20:49.420Z [DEBUG] secrets.pki.pki_7168a51f: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:20:49.420Z [DEBUG] secrets.pki.pki_6ac52b1a: gathered 0 revocations and 0 confirmation entries
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_f56d979e: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_0aa24143: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_132db0c9: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_d1aeb307: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_8d400199: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_956ae8ea: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_9f1d286c: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_37779112: starting to process unified revocations
2024-08-29T10:20:49.440Z [DEBUG] secrets.pki.pki_747f8a06: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_ed6a508b: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_8409a55d: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_e4ab0e10: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_d145f4f9: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_a3d9cf40: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_483486dd: starting to process unified revocations
2024-08-29T10:20:49.441Z [DEBUG] secrets.pki.pki_a6c6f6d3: starting to process unified revocations
............................

And this cluster have lot of pki tenants

 gkalesha:10:53 PM IST :pki% vault secrets list | grep pki | wc -l
    1189
 gkalesha:10:53 PM IST :pki%

Mostly all tenants acme shows

Key                         Value
---                         -----
allow_role_ext_key_usage    false
allowed_issuers             [*]
allowed_roles               [*]
default_directory_policy    sign-verbatim
dns_resolver                n/a
eab_policy                  not-required
enabled                     false

@divyaac
Copy link
Contributor

divyaac commented Aug 29, 2024

Hi @kaleshag would you be able to provide us with a debug package? https://developer.hashicorp.com/vault/tutorials/monitoring/troubleshooting-vault#vault-debug-tool

@kaleshag
Copy link
Author

@divyaac
I am attaching the vault debug as requested here.
vault-debug-2024-08-30T03-08-21Z.tar.gz

@kaleshag
Copy link
Author

@divyaac do we have any update on this request? We are blocked with many clusters because of this issue.

@kaleshag
Copy link
Author

Also wanted to know, even though we have not enabled ACME, why we are seeing ACME engine validations in debug log @divyaac

@divyaac
Copy link
Contributor

divyaac commented Aug 30, 2024

Hi @kaleshag thank you for sending the debug package over! Let me do some investigation and get back to you.

@divyaac
Copy link
Contributor

divyaac commented Aug 30, 2024

Also wanted to know even though we have not enabled ACME, why are we seeing ACME engine validations in the debug log.

As of 1.14.0, ACME is supported on PKI engines. As a result, ACME paths are added to the PKI backend during startup.

We are still investigating the high memory usage.

@divyaac
Copy link
Contributor

divyaac commented Aug 30, 2024

Hi @kaleshag, the issue might be because it is adding all the new ACME paths in the current version for all of the pki mounts. The high memory usage is during the regexes for the new paths, while adding them.

Is the cluster unusable?

@kaleshag
Copy link
Author

kaleshag commented Aug 31, 2024

@divyaac Yes the vault pods are getting restarted with OOM Killed. Is there a way we can remove those acme paths validations because we are not using any domain certificates here. So we can avoid unnecessary load on the cluster.

So we want to know how to disable that option to avoid this memory spike on our clusters.

@kaleshag
Copy link
Author

kaleshag commented Sep 1, 2024

In my old version of vault (1.12.9), I don't see any value for acme

vault read <pki_path>/config/acme
No value found at <pki_path>/config/acme

But where as in my new version of vault(1.15.6), I can see the acme configuration, but they are in disable state. Is it they are causing the issues? If yes, how can we remove them in our vault version.

vault read pki_path/config/acme 
Key                         Value
---                         -----
allow_role_ext_key_usage    false
allowed_issuers             [*]
allowed_roles               [*]
default_directory_policy    sign-verbatim
dns_resolver                n/a
eab_policy                  not-required
enabled                     false

@divyaac
Even though all pki paths shows as disabled, how come ACME is trying to add those paths in backend?

@kaleshag
Copy link
Author

kaleshag commented Sep 3, 2024

@divyaac Any update on this issue?

@divyaac
Copy link
Contributor

divyaac commented Sep 3, 2024

@kaleshag We cannot conditionally add paths at this time, so ACME paths are registered even if not enabled.
The easiest next step would be to increase the amount of RAM the pod is allowed. This will allow the pod to startup after the first rollback successfully initializes, and the RAM will go down after startup.

@kaleshag
Copy link
Author

kaleshag commented Sep 3, 2024

@divyaac we already did that and pods are coming up with increased limits. But the utilization is not going down. If you look at my previously uploaded graphs to this case explains you the scenario, where it keeps almost similar utilization like startup or increase it little bit. That's where we have the concern and opened this issue for your support.

Is it expected or can we do anything to fix this memory utilization after startup also?

@kaleshag
Copy link
Author

kaleshag commented Sep 4, 2024

@divyaac
As of now I have two queries to clarify.

  1. As per your previous comment I am not seeing memory utilisation reduces even though after successful vault startup.

  2. “Why is an ACME challenge validation engine allocated for every PKI secrets engine created using a new path, even though our default policy has ACME disabled?”

vault secrets enable -path=“echebao-pki” pki
2024-08-29T09:06:00.776Z [DEBUG] secrets.pki.pki_9e2efc76: Loading existing challenge validations on disk
2024-08-29T09:06:00.777Z [DEBUG] secrets.pki.pki_66b6e7f6: Starting ACME challenge validation engine

Key                         Value
---                         -----
allow_role_ext_key_usage    false
allowed_issuers             [*]
allowed_roles               [*]
default_directory_policy    sign-verbatim
dns_resolver                n/a
eab_policy                  not-required
enabled                     false

@1337Seeker
Copy link

@kaleshag, as a precaution upgrade to version 1.15.4 (instead of 1.15.6) and monitor the performance of your vault cluster.
We identified a bug in March already affecting versions 1.15.5 and later which was reported under: #26036.

@kaleshag
Copy link
Author

kaleshag commented Sep 5, 2024

@1337Seeker
So is it same for all future versions like 1.16.x and 1.17.x? and every where it was mentioned in that thread as CPU issue and no where mentioned about Memory issue. How come that issue related to this? We are not seeing CPU utilization here and seeing only Memory utilization. Can you please confirm about this?

@kaleshag
Copy link
Author

kaleshag commented Sep 9, 2024

Hi Team
@1337Seeker or @divyaac Can you please provide the update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug performance secret/pki
Projects
None yet
Development

No branches or pull requests

4 participants