-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add adaptive retry logic in RCI call for non-terminal errors #4499
base: dev
Are you sure you want to change the base?
Conversation
99cedbd
to
34b63af
Compare
defer cancel() | ||
err := retry.RetryWithBackoffCtx(ctx, backoff, | ||
func() error { | ||
containerInstanceARN, availabilityZone, errFromRCI = client.registerContainerInstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Don't we already use the default retryer from the AWS SDK under the hood in this call? That would mean that we perform additional retries on the actual network call.
Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline. Adding a summary here: the aws sdk retry only retries for 3 times and probably under 10 seconds, to address potential network issues. This change is to add more control over the overall initialization workflow. Without this, after the default 3 retries, the agent would exit and restart, and after around 3 seconds it repeats the same process again, we're essentially retrying 3 times every 15 seconds. We want to add more control in this process, where the backoff we add can go up to 3 minutes. This is to help alleviate systematic account-level throttling.
@@ -63,6 +63,14 @@ const ( | |||
setInstanceIdRetryBackoffMax = 5 * time.Second | |||
setInstanceIdRetryBackoffJitter = 0.2 | |||
setInstanceIdRetryBackoffMultiple = 2 | |||
// Below constants are used for RegisterContainerInstance retry with exponential backoff when receiving non-termianl errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo in "non-termianl"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing this in the next revision
// Using errors.As to unwrap as opposed to errors.Is. | ||
if errors.As(err, &awsErr) { | ||
switch awsErr.Code() { | ||
case ecsmodel.ErrCodeServerException, "ThrottlingException": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no constant exposed by the SDK for "ThrottlingException" string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no, so I had to manually add the string here.
// In the happy test, the last RCI call will succeed, and we will examine the expected attributes are present. | ||
// In the unhappy test, the last RCI call will fail with ClientException, and an appropriate error will be returned. | ||
// For both test cases, the last RCI call should effectively terminate the retry loop. | ||
func TestRegisterContainerInstanceWithRetryNonTerminalError(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take to execute? If it takes ~10 seconds then I suggest we override the backoff settings so that the test is fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can take up to 20 seconds. Thanks for the call out, will address this in the next revision. Added an override to bring the execution down to 0.5 second.
Summary
Add retry with exponential backoff when receiving non-terminal error from RCI calls, to prevent retry-storms.
Implementation details
A wrapper
registerContainerInstanceWithRetry
is added around the originalregisterContainerInstance
method. It utilizes theRetryWithBackoffCtx
method from the retry package. Upon receiving failures from RCI, we will examine the error type to determine if it's a terminal error, if yes, we will break the retry loop, otherwise, we will continue to retry with increased backoff time. The max backoff time is capped at ~3 minutes to ensure we don't wait too long between retries.Testing
A new test
TestRegisterContainerInstanceWithRetryNonTerminalError
has been added to test both the happy and unhappy case.New tests cover the changes: yes
Description for the changelog
Add adaptive retry logic in RCI call for non-terminal errors.
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
Does this PR include the addition of new environment variables in the README?
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.