-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High "mapped" memory usage and disk IO when tail-based sampling is enabled #13463
Comments
Upon investigation, this is likely related to the prefetch behavior of local TBS badger database iterator, triggered by ReadTraceEvents which is called on every sampling decision received (both local and remote decisions). ReadTraceEvents cannot use the table's bloom filter because we are searching for events using trace ID, while a full key consists of both trace ID and txn/span ID, it has to use the iterator with a prefix. Prefetch behavior is enabled by default and set to 100 values, and it fetches values from vlog when using an iterator. Unfortunately, its implementation does not respect the prefix, meaning that even when prefix does not match, it still fetches 100 values from vlog. This is mostly affecting setups with multiple apm-servers because e.g. apm-server A receive sampling decisions made by a remote apm-server B. And it is likely that the sampling decision is for a trace that A does not know and does not store. The right thing here to do is to scan the in-memory LSM tree to see if there's a prefix match, but in the current implementation, due to prefetch, it still scans vlogs for 100 values of irrelevant keys. As vlog files are |
Here's a minimal reproducible example of the issue, with memory and disk IO measurements: https://github.com/carsonip/tbs-badger-playground/tree/main/prefetch |
APM Server version (
apm-server version
): confirmed on 8.13.2 but affects all versions including the latest 8.14.1Description of the problem including expected versus actual behavior:
When tail-based sampling (TBS) is enabled, the memory usage will go as high as the local TBS database storage size. When viewing
/proc/meminfo
, most of the memory usage shows up as "Mapped". This is particularly noticeable in setups which consist of multiple apm-servers and receive high load.Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.
The text was updated successfully, but these errors were encountered: