-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on failed docker push #246
Conversation
This is great to me. I have tested this with base image builder and a private registry where I can simulate failures. First test. The registry fails completely and never recovers.
Second test: Registry recovers after a few attempts.
It retries five times. This means it actually tried six times, the initial try plus five more. This is noted in the code as well. It would be great to see some kind of logging. Perhaps as simple as just I wonder that the delay period is too short. Using five retries, all the retries are exhausted in 72 seconds. Perhaps use |
Thanks you for the careful review and testing! It is super helpful. Both of your suggestions sound good to me and I'll update accordingly. |
Making it easier to see and manage the dependencies.
9acfa87
to
6cae66d
Compare
I've increased the base of the backoff to Since the attempts in the stream is supplied on demand, and the backoff just |
To support the logging, this reworks the `Lwt_retry` logic so that erroneous attempts include an int representing which number of attempt it was. This allows users of the library to easily add logging that tracks attempt numbers, and tracks how many attempts where made leading up to a retry exhaustion or fatal error. We were already producing this this data structure for the sleep logic, so this change moves that structure into the public API, since it has general utility.
The logging is added in 8c6f9ec. Thanks again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shonfeder. This looks great and worked nicely in testing.
Test 1: Times out completely
Pushing "sha256:f187dcc55329dbe2463d18dacfad6ce5447506f6acbe8790c5af04bc2eed4593" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 1/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 4.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 2/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 16.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 3/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 48.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 4/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 128.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 5/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 320.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 6/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 768.000000 seconds...
`Retry Push exhausted retries after 6 attempts.
docker-push failed with exit-code 1
2024-08-30 14:05.32: Job failed: Failed: Build failed
Test 2: Works after a new retries
Pushing "sha256:a716b0b908eb08e7af80f3a16dcb09c68b99710716e5fecb401ab45225a87197" to "docker.tunbury.org.uk/opam-staging:ubuntu-24.04-opam-arm64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 1/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 4.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
Get "https://docker.tunbury.org.uk/v2/": dial tcp 5.102.169.187:443: connect: connection refused
Push attempt 2/6 failed with retryable error "docker-push failed with exit-code 1". Retrying in 16.000000 seconds...
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
e257cdec393a: Pushed
ubuntu-24.04-opam-arm64: digest: sha256:de20ef87323c280699a5977b5f65ed0540032f074ded81fb880693603d1878fe size: 530
The push refers to repository [docker.tunbury.org.uk/opam-staging]
e257cdec393a: Preparing
e257cdec393a: Layer already exists
ubuntu-24.04-opam-arm64: digest: sha256:de20ef87323c280699a5977b5f65ed0540032f074ded81fb880693603d1878fe size: 530
Job succeeded
2024-08-30 14:29.15: Job succeeded
Co-authored-by: Mark Elvers <[email protected]>
Looks great! Thank you :) |
Contributes to ocurrent/docker-base-images#211
This is a retargeted refinement of ocurrent/ocurrent#451
The approach proposed here provides three small, composable, Lwt combinators to offer retry logic in a flexible way. The base function,
on_error
produces a (potentially) unlimited sequence of attempts. Adding sleeps between attempts and stopping after a set number of retries is provided by small functions that consume the stream. Additionally functionality can easily be added in the similar fashion.There is a clear recurring need for lwt-level retry logic, as documented in ocurrent/ocurrent#450, which is why a general approach is proposed here. However, we also have a pressing need to get retries in place for the ephemeral failures from docker pushes that are eating up maintenance time. As a compromise, this utility as being added initially at the point of use, but as a relocatable library component, which we can easily promote to a shareable location when needed.
Many thanks to @mtelvers who has been a big help in thinking through what fix is need and where it should go.