-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid retry on non-recoverable errors. #101
base: master
Are you sure you want to change the base?
Avoid retry on non-recoverable errors. #101
Conversation
func (b *sampleBuilder) seriesGetWithRetry(ctx context.Context, sample tsdb.RefSample) (*seriesCacheEntry, bool) { | ||
backoff := time.Duration(0) | ||
entry, ok, err := b.series.get(ctx, sample.Ref) | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to honor ctx.Done anymore. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now ctx.Done() is captured in this function. If context cancellation happens, it will at least be caught here (though if ctx.Err() is thrown inside series.get(), it won't immediately handle it. It needs to wait until next iteration to check ctx.Done()).
I didn't re-check ctx.Done() after calling series.geT() is because I think checking ctx.Done() is better at the top of the for-loop for simplicity and readability.
Can you include the rationale for this change? |
Still work in progress - as I haven't provided rationale yet. |
@jkohen ready for review. please take a look on the new change. The rationale is updated in 1st comment of the PR. |
retrieval/transform.go
Outdated
if err == nil { | ||
break | ||
} | ||
if _, ok := err.(unknownMetricError); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about adding such specific error taxonomy. We may end up with too many conditionals. I would ask what could lead to different handling. Is it just about recoverable and permanent? Could we call it recoverable, as in other parts of the code?
Only retry when calling seriesCache.get().
For error caused by unexpected metric name suffix or by unexpected metric type, it should return as unrecoverable error because retrying doesn't help the situation.
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
8d5cbec
to
027c2bf
Compare
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Rationale behind this change:
in
retrieval.manager()
, retry with exponential backoff is used when building Stackdriver sample. However, there are 2 types of errors which will always be thrown, and retrying wouldn't help:stackdriver-prometheus-sidecar/retrieval/transform.go
Line 96 in bfe3938
stackdriver-prometheus-sidecar/retrieval/transform.go
Line 114 in bfe3938
This goal of this change is to avoid retrying on these 2 types of errors. The implementation is to only retry on recoverable errors.
When I implemented this, I also found that
seriesCache.refresh()
will potentially throw errors which is mostly not recoverable, hence making sure we wouldn't retry on non-recoverable errors.