Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid retry on non-recoverable errors. #101

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

StevenYCChou
Copy link
Contributor

@StevenYCChou StevenYCChou commented Mar 7, 2019

Rationale behind this change:

in retrieval.manager(), retry with exponential backoff is used when building Stackdriver sample. However, there are 2 types of errors which will always be thrown, and retrying wouldn't help:

  1. Unknown metric suffix:
    return nil, 0, samples[1:], errors.Errorf("unexpected metric name suffix %q", entry.suffix)
  2. Unknown metric type:
    return nil, 0, samples[1:], errors.Errorf("unexpected metric type %s", entry.metadata.Type)

This goal of this change is to avoid retrying on these 2 types of errors. The implementation is to only retry on recoverable errors.

When I implemented this, I also found that seriesCache.refresh() will potentially throw errors which is mostly not recoverable, hence making sure we wouldn't retry on non-recoverable errors.

@StevenYCChou StevenYCChou requested a review from jkohen March 7, 2019 03:16
retrieval/transform.go Outdated Show resolved Hide resolved
retrieval/transform.go Outdated Show resolved Hide resolved
func (b *sampleBuilder) seriesGetWithRetry(ctx context.Context, sample tsdb.RefSample) (*seriesCacheEntry, bool) {
backoff := time.Duration(0)
entry, ok, err := b.series.get(ctx, sample.Ref)
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to honor ctx.Done anymore. Is that intended?

Copy link
Contributor Author

@StevenYCChou StevenYCChou Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now ctx.Done() is captured in this function. If context cancellation happens, it will at least be caught here (though if ctx.Err() is thrown inside series.get(), it won't immediately handle it. It needs to wait until next iteration to check ctx.Done()).

I didn't re-check ctx.Done() after calling series.geT() is because I think checking ctx.Done() is better at the top of the for-loop for simplicity and readability.

@jkohen
Copy link
Contributor

jkohen commented Mar 7, 2019

Can you include the rationale for this change?

@StevenYCChou
Copy link
Contributor Author

Still work in progress - as I haven't provided rationale yet.

@StevenYCChou
Copy link
Contributor Author

@jkohen ready for review. please take a look on the new change. The rationale is updated in 1st comment of the PR.

if err == nil {
break
}
if _, ok := err.(unknownMetricError); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding such specific error taxonomy. We may end up with too many conditionals. I would ask what could lead to different handling. Is it just about recoverable and permanent? Could we call it recoverable, as in other parts of the code?

StevenYCChou and others added 6 commits August 18, 2019 02:47
Only retry when calling  seriesCache.get().
For error caused by unexpected metric name suffix or by unexpected
metric type, it should return as unrecoverable error because retrying
doesn't help the situation.
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@StevenYCChou StevenYCChou force-pushed the use-exponential-backoff-only-when-cache-get branch from 8d5cbec to 027c2bf Compare August 18, 2019 05:23
@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants