Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage after upgrade to version 1.15.5 #26036

Open
vmaletic opened this issue Mar 19, 2024 · 17 comments
Open

High CPU usage after upgrade to version 1.15.5 #26036

vmaletic opened this issue Mar 19, 2024 · 17 comments

Comments

@vmaletic
Copy link

Describe the bug
After upgrading from Vault version 1.15.4 to 1.15.5, there is high CPU usage on Vault servers when transit operations are called, even with a relatively small number of requests per second (RPS), causing CPU core usage to reach 100%.

To Reproduce
Steps to reproduce the behavior:

  1. Execute HTTP API calls: transit/encrypt/my-key and transit/decrypt/my-key and monitor
  2. Monitor CPU usage of Vault primary node

Expected behavior
After upgrading from Vault version 1.15.4 to 1.15.5, the CPU usage during transit operations should remain within acceptable limits. Specifically, the CPU core usage should not spike to 100% under small RPS.

Environment:

  • Vault Server Version (retrieve with vault status): Vault v1.15.5
  • Vault CLI Version (retrieve with vault version): Vault v1.15.5 (0d8b67e), built 2024-01-26T14:53:40Z
  • Server Operating System/Architecture: Linux x86_64

Vault server configuration file(s):

backend "consul" {
    address="127.0.0.1:8500"
    path="vault-uat01"
    ha_enabled="true"
}

listener "tcp" {
    address="xxx:8200"
    tls_disable=0
    tls_min_version="tls12"
    tls_cert_file="/etc/vault/ssl/vault.crt"
    tls_key_file="/etc/vault/ssl/vault.key"
    tls_cipher_suites = "TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"

}


telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
}

max_lease_ttl = "1500h" 

Additional context
Vault telemetry for version 1.15.5 with max. 300 RPS to transit backends during 5 minutes testing timeframe

CPU usage
image

Transit usage
image

Vault telemetry for version 1.15.4 with max. 2000 RPS to transit backends during 45 minutes testing timeframe

CPU usage
image

Transit usage
image

@vmaletic
Copy link
Author

The same behaviour is observable in version 1.15.6

@cleclefibanity
Copy link

we got the problem too. Did you find a reason? We moved from 1.12 to 1.16

@vmaletic
Copy link
Author

vmaletic commented Jun 7, 2024

Unfortunately, no. We are sticking with version 1.15.4. We tested all subsequent versions (1.15.5 and later, including 1.16.x) and observed the same behavior.

@heatherezell
Copy link
Contributor

Thank you for testing this on 1.16 as well. I'll bring it up to our engineers. :)

@cleclefibanity
Copy link

Weird stuff: we rotate the transit key, and it solved the issue. We don't understand what could be the difference, as the old & the new keys are both working. Just the old one is causing high CPU usage

@1337Seeker
Copy link

yeah this is really strange behavior for sure, but that's pretty good news and something we will test and report back on

@1337Seeker
Copy link

Yesterday, we performed transit key rotation on all our transit secret engines. Subsequently, we upgraded to the latest version of Vault (1.17.0) and initiated our standard load testing. Unfortunately, we encountered significant performance degradation, which we had previously reported. Specifically:

  • CPU load reaching 90% and goroutines doubled compared to load testing against Vault 1.15.4.
  • Within just 5 minutes of load testing, we were only able to achieve approximately 120 requests per second (RPS).

Interestingly, reverting to Vault 1.15.4 resolved the issue entirely. With this version, performance is optimal, reaching up to 50-60% CPU load at 1000 RPS.

We are keen to understand why this performance discrepancy exists since versions 1.15.5 and 1.17.0. Any insights would be greatly appreciated.

@cleclefibanity
Copy link

May you rotate your key again to see if it fixes the problem ? That's how we solved it

@cleclefibanity
Copy link

not other info? We're about to rotate our key to solve the problem, but that's a pretty odd solution, without clear reason on the root cause

@1337Seeker
Copy link

Out of interest, did you rotate your transit keys while running on the latest version of Vault or did you complete the transit key rotation using a specific version of Vault and then upgrading to the latest version?

Please provide more information in terms of what worked for you (in order for us to test if we can replicate with same success as you've reported).
Thank you in advance!

@cleclefibanity
Copy link

We upgraded first. Then we realised that there was an issue, and decided to rotate the keys (still on the newest version). Then the problem was solved

@stevendpclark
Copy link
Contributor

Hello all,

I've been trying to reproduce this issue without success unfortunately. Would it be possible to provide additional information such

  • key types, key arguments enabled
  • arguments being used during the encryption/decryption calls
  • is auditing enabled
  • what backend is being used
  • Is key auto-rotation enabled
  • are there many different key versions on the key in question
  • are any other calls besides encrypt/decrypt problematic

Please note that within 1.15.6 a locking issue was resolved within #25336 but that doesn't sound like issue you are reporting (and should have been resolved in the later versions you have tested.)

Thanks!

@vmaletic
Copy link
Author

Hello, this is setup on our side:

  • key types, key arguments enabled
    • convergent_encryption=false derived=false exportable=true type=aes256-gcm96 allow_plaintext_backup=true
  • arguments being used during the encryption/decryption calls
    • plaintext argument for encryption, ciphertext for decryption
  • is auditing enabled
    • Yes, it is enabled
  • what backend is being used
    • consul is used for backend
  • Is key auto-rotation enabled
    • No, it is not
  • are there many different key versions on the key in question
    • Up to 5 versions, but mainly latest version is used for encryption and decryption
  • are any other calls besides encrypt/decrypt problematic
    • Not observed so far

@vmaletic
Copy link
Author

vmaletic commented Jul 19, 2024

After audit devices were disabled, we managed to reach 800+ RPS, so it seems the audit is a culprit of high CPU usage after upgrade to version 1.15.5.

@heatherezell
Copy link
Contributor

After audit devices were disabled, we managed to reach 800+ RPS, so it seems the audit is a culprit of high CPU usage after upgrade to version 1.15.5.

Thank you for that! Very helpful to know. I've taken it back to our engineers and they're brainstorming about possible culprits.

@divyaac
Copy link
Contributor

divyaac commented Aug 19, 2024

After audit devices were disabled, we managed to reach 800+ RPS, so it seems the audit is a culprit of high CPU usage after upgrade to version 1.15.5.

Thank you so much! Do you have any logs that we could use to help narrow down the audit issue even further?

@divyaac
Copy link
Contributor

divyaac commented Aug 29, 2024

This issue might be related : #28170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants