Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on failed docker push #246

Merged
merged 6 commits into from
Aug 30, 2024
Merged

Retry on failed docker push #246

merged 6 commits into from
Aug 30, 2024

Conversation

shonfeder
Copy link
Contributor

Contributes to ocurrent/docker-base-images#211
This is a retargeted refinement of ocurrent/ocurrent#451

The approach proposed here provides three small, composable, Lwt combinators to offer retry logic in a flexible way. The base function, on_error produces a (potentially) unlimited sequence of attempts. Adding sleeps between attempts and stopping after a set number of retries is provided by small functions that consume the stream. Additionally functionality can easily be added in the similar fashion.

There is a clear recurring need for lwt-level retry logic, as documented in ocurrent/ocurrent#450, which is why a general approach is proposed here. However, we also have a pressing need to get retries in place for the ephemeral failures from docker pushes that are eating up maintenance time. As a compromise, this utility as being added initially at the point of use, but as a relocatable library component, which we can easily promote to a shareable location when needed.

Many thanks to @mtelvers who has been a big help in thinking through what fix is need and where it should go.

@shonfeder shonfeder requested a review from mtelvers August 29, 2024 01:59
@mtelvers
Copy link
Member

This is great to me. I have tested this with base image builder and a private registry where I can simulate failures.

First test. The registry fails completely and never recovers.

Pushing "sha256:00ee024b71956601bc1f200582ce14884ad54810d0e7cf855f7f3f77c3d957bb" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
a41fedb71b6b: Preparing
dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
docker-push failed with exit-code 1
2024-08-29 11:21.33: Job failed: Failed: Build failed

Second test: Registry recovers after a few attempts.

Pushing "sha256:963ff6ccf841860bbaf9ca7fb76f62f91ed76ada172e528ba9125ae3910f41d9" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e049cd353618: Preparing
dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e049cd353618: Preparing
e049cd353618: Pushed
ubuntu-24.04-opam-arm64: digest: sha256:b38348c93811ae63892446949a7a32bf9555de0be6debf9c1363f3653c3ac5f6 size: 530
Job succeeded
2024-08-29 11:28.04: Job succeeded

It retries five times. This means it actually tried six times, the initial try plus five more. This is noted in the code as well.

It would be great to see some kind of logging. Perhaps as simple as just Retrying or something fancy like failed with exit-code 1. Retrying in 5 seconds., then after the sleep, Retry 1/5.

I wonder that the delay period is too short. Using five retries, all the retries are exhausted in 72 seconds. Perhaps use 2 ^ n rather than 1.5 ^ n, which would give 196 seconds in total (0, 4, 16, 48, 128), and/or skip the retry at 0 seconds. This would result in 148 seconds with 1.5 or 516 seconds with 2.

@shonfeder
Copy link
Contributor Author

Thanks you for the careful review and testing! It is super helpful. Both of your suggestions sound good to me and I'll update accordingly.

@shonfeder shonfeder force-pushed the retry-on-failed-docker-push branch from 9acfa87 to 6cae66d Compare August 29, 2024 18:27
@shonfeder
Copy link
Contributor Author

shonfeder commented Aug 29, 2024

Perhaps use 2 ^ n rather than 1.5 ^ n, which
would give 196 seconds in total (0, 4, 16, 48, 128), and/or skip the retry at 0
seconds. This would result in 148 seconds with 1.5 or 516 seconds with 2.

I've increased the base of the backoff to 2 in 8d9003d.

Since the attempts in the stream is supplied on demand, and the backoff just
adds a delay between the request and fulfillment of the demand, the 0 ends up
applying to first attempt (aka the "try") so we try right away, and start
applying delays to each subsequent attempt (aka the "retries").

To support the logging, this reworks the `Lwt_retry` logic so that
erroneous attempts include an int representing which number of attempt
it was. This allows users of the library to easily add logging that
tracks attempt numbers, and tracks how many attempts where made leading
up to a retry exhaustion or fatal error.

We were already producing this this data structure for the sleep logic,
so this change moves that structure into the public API, since it has
general utility.
@shonfeder
Copy link
Contributor Author

The logging is added in 8c6f9ec. Thanks again
for the review and suggestions! I think this has ended up in a nicer API and
with more useful logging as a result :)

@shonfeder shonfeder self-assigned this Aug 30, 2024
Copy link
Member

@mtelvers mtelvers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shonfeder. This looks great and worked nicely in testing.

Test 1: Times out completely

Pushing "sha256:f187dcc55329dbe2463d18dacfad6ce5447506f6acbe8790c5af04bc2eed4593" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 1/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 4.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 2/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 16.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 3/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 48.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 4/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 128.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 5/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 320.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 6/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 768.000000 seconds...
`Retry Push exhausted retries after 6 attempts.
docker-push failed with exit-code 1
2024-08-30 14:05.32: Job failed: Failed: Build failed

Test 2: Works after a new retries

Pushing "sha256:a716b0b908eb08e7af80f3a16dcb09c68b99710716e5fecb401ab45225a87197" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 1/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 4.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 2/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 16.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
e257cdec393a: Pushed
ubuntu-24.04-opam-arm64: digest: sha256:de20ef87323c280699a5977b5f65ed0540032f074ded81fb880693603d1878fe size: 530
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
e257cdec393a: Layer already exists
ubuntu-24.04-opam-arm64: digest: sha256:de20ef87323c280699a5977b5f65ed0540032f074ded81fb880693603d1878fe size: 530
Job succeeded
2024-08-30 14:29.15: Job succeeded

worker/cluster_worker.ml Outdated Show resolved Hide resolved
Co-authored-by: Mark Elvers <[email protected]>
@shonfeder
Copy link
Contributor Author

Looks great! Thank you :)

@mtelvers mtelvers merged commit a27ef61 into master Aug 30, 2024
1 of 2 checks passed
@punchagan punchagan deleted the retry-on-failed-docker-push branch September 3, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants