-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.15] output: Retry document-level 429
s by default (backport #13620)
#13621
Conversation
Updates the APM Server to automatically retry document-level `429`s from Elasticsearch to avoid dropping data. It can be configured/overwritten by `output.elasticsearch.max_retries`, and defaults to `3`. It uses the default backoff configuration, which could wait up to 1m if enough retries are configured, but can be overwritten as well. --------- Signed-off-by: Marc Lopez Rubio <[email protected]> Co-authored-by: Carson Ip <[email protected]> (cherry picked from commit 28436dc) # Conflicts: # changelogs/head.asciidoc
Cherry-pick of 28436dc has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Signed-off-by: Marc Lopez Rubio <[email protected]>
Unfortunately I haven't had the chance to test this out today, nor to investigate further how the change could be validated. One good thing was mentioned in our weekly by @marclop: the daily benchmarks would give us decent feedback, so I think one interesting way to see if the change affects overall performance would be to run the benchmarks on this PR's branch? |
// Default to 1mib flushes, which is the default for go-docappender. | ||
esConfig.FlushBytes = "1 mib" | ||
esConfig.FlushInterval = time.Second | ||
esConfig.Config = elasticsearch.DefaultConfig() | ||
esConfig.MaxIdleConnsPerHost = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not moving MaxIdleConnsPerHost
into structure initialization as other fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's part of Elasticsearch.Config
which is set already with elasticsearch.DefaultConfig()
.
I tried running the benchmarks against the smallest instance (1GB) in https://github.com/elastic/apm-server/actions/runs/9872740538 |
benchmarks are broken due to Kibana (Kyung Eun had mentioned that in the standup, and now I recalled it 😓 ) |
Here's a methodology that I think can get us some feedback.
The goal will be to detect if ES has higher load or if the number of total ingested events differs between the 2 version. Thoughts? |
Testing methodology:
Results 8.15 BC 18.15 + this PRNotes: in the ES screenshots for the 8.15 + this PR case there is double load being ingested into APM server. Conclusions: given the double load on ES, the peak of 429s that also doubled is somewhat expected. APM server did get an increase in memory used, with the average going from ~350 MiB to ~500 MiB. I can't explain how this happens. With the data presented I think there is not a significant overhead imposed to APM server or ES by this PR, so I am inclined to merge it. |
Looking at the apm-server RSS before vs after, I do not see cause for concern, particularly since the load was doubled in the after benchmark. @inge4pres correct me if I'm wrong, but that seems to be the case. BeforeAfter |
I agree 👍🏼 |
Motivation/summary
Updates the APM Server to automatically retry document-level
429
s from Elasticsearch to avoid dropping data. It can be configured/overwritten byoutput.elasticsearch.max_retries
, and defaults to3
.It uses the default backoff configuration, which could wait up to 1m if enough retries are configured, but can be overwritten as well.
This reduces data loss whenever Elasticsearch is overhelmed and reduces the chances of data loss due to ES push-back.
Checklist
- [ ] Documentation has been updatedHow to test these changes
This may need a bit of setup. It's thoroughly tested in the
go-docappender
, but we may want to verify that retries happen when document-level 429s happen.I'd suggest writing a small "proxy" that returns 429s randomly and test that things are indexed (as duplicates).
Related issues
Closes #13619
This is an automatic backport of pull request #13620 done by Mergify.