Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adaptive retry logic in RCI call for non-terminal errors #4499

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

tshan2001
Copy link
Contributor

@tshan2001 tshan2001 commented Feb 11, 2025

Summary

Add retry with exponential backoff when receiving non-terminal error from RCI calls, to prevent retry-storms.

Implementation details

A wrapper registerContainerInstanceWithRetry is added around the original registerContainerInstance method. It utilizes the RetryWithBackoffCtx method from the retry package. Upon receiving failures from RCI, we will examine the error type to determine if it's a terminal error, if yes, we will break the retry loop, otherwise, we will continue to retry with increased backoff time. The max backoff time is capped at ~3 minutes to ensure we don't wait too long between retries.

Testing

A new test TestRegisterContainerInstanceWithRetryNonTerminalError has been added to test both the happy and unhappy case.

New tests cover the changes: yes

Description for the changelog

Add adaptive retry logic in RCI call for non-terminal errors.

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

Does this PR include the addition of new environment variables in the README?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tshan2001 tshan2001 requested a review from a team as a code owner February 11, 2025 19:57
@tshan2001 tshan2001 force-pushed the master branch 2 times, most recently from 99cedbd to 34b63af Compare February 11, 2025 20:48
@tshan2001 tshan2001 requested a review from amogh09 February 11, 2025 21:53
defer cancel()
err := retry.RetryWithBackoffCtx(ctx, backoff,
func() error {
containerInstanceARN, availabilityZone, errFromRCI = client.registerContainerInstance(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Don't we already use the default retryer from the AWS SDK under the hood in this call? That would mean that we perform additional retries on the actual network call.

Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. Adding a summary here: the aws sdk retry only retries for 3 times and probably under 10 seconds, to address potential network issues. This change is to add more control over the overall initialization workflow. Without this, after the default 3 retries, the agent would exit and restart, and after around 3 seconds it repeats the same process again, we're essentially retrying 3 times every 15 seconds. We want to add more control in this process, where the backoff we add can go up to 3 minutes. This is to help alleviate systematic account-level throttling.

xxx0624
xxx0624 previously approved these changes Feb 13, 2025
@@ -63,6 +63,14 @@ const (
setInstanceIdRetryBackoffMax = 5 * time.Second
setInstanceIdRetryBackoffJitter = 0.2
setInstanceIdRetryBackoffMultiple = 2
// Below constants are used for RegisterContainerInstance retry with exponential backoff when receiving non-termianl errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in "non-termianl"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this in the next revision

// Using errors.As to unwrap as opposed to errors.Is.
if errors.As(err, &awsErr) {
switch awsErr.Code() {
case ecsmodel.ErrCodeServerException, "ThrottlingException":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no constant exposed by the SDK for "ThrottlingException" string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, so I had to manually add the string here.

// In the happy test, the last RCI call will succeed, and we will examine the expected attributes are present.
// In the unhappy test, the last RCI call will fail with ClientException, and an appropriate error will be returned.
// For both test cases, the last RCI call should effectively terminate the retry loop.
func TestRegisterContainerInstanceWithRetryNonTerminalError(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take to execute? If it takes ~10 seconds then I suggest we override the backoff settings so that the test is fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can take up to 20 seconds. Thanks for the call out, will address this in the next revision. Added an override to bring the execution down to 0.5 second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants