Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with tags removed from ECR #632

Closed
tcyran opened this issue Dec 14, 2023 · 2 comments · Fixed by #669
Closed

Issue with tags removed from ECR #632

tcyran opened this issue Dec 14, 2023 · 2 comments · Fixed by #669
Assignees
Labels
bug Something isn't working

Comments

@tcyran
Copy link

tcyran commented Dec 14, 2023

What we did:
We introduced LifeCyclePolicy into ECR to avoid having cached really old images.
We set this to keep 3 latest tags, which in effect has removed a lot of tags including old ones and currently used.
After cleanup k8s-image-swapper still recognized image as existing in ECR and mutate pod to start with ECR cached image, which ends up with ImagePullBackOff

What is the issue:
Seems that there is some cache for skopeo, which see image even if it not exists.
After deleting/recreating image-swapper pod situation get backs to normal.

Steps to reproduce:

  1. Start deployment with nginx:1:14.2
  2. Wait until k8s-image-swapper will cache image
  3. Restart nginx deployment - it will be started with cached image
  4. Remove image tag from ECR
  5. Restart nginx deployment - it will fall into ImagePullBackOff

Logs:

2023-12-14T12:07:18+01:00 11:07AM DBG github.com/estahn/[email protected]/pkg/webhook/image_swapper.go:285 > jmespath search results filter="obj.metadata.namespace == 'kube-system'" results=false
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/registry/ecr.go:239 > found in cache kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 ref=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM DBG github.com/estahn/[email protected]/pkg/webhook/image_swapper.go:251 > set new container image image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/registry/ecr.go:239 > found in cache kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 ref=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 source-image=docker.io/library/nginx:1.14.2 target-image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/webhook/image_copier.go:71 > image copy aborted: image already present in target registry kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 source-image=docker.io/library/nginx:1.14.2 target-image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d

Additional info:
Prove that image-tag is missing

aws ecr list-images --repository-name docker.io/library/nginx --filter '{ "tagStatus": "TAGGED" }'
    "imageIds": [
        {
            "imageDigest": "sha256:644a70516a26004c97d0d85c7fe1d0c3a67ea8ab7ddf4aff193d9f301670cf36",
            "imageTag": "1.21.3"
        },
        {
            "imageDigest": "sha256:08bc36ad52474e528cc1ea3426b5e3f4bad8a130318e3140d6cfe29c8892c7ef",
            "imageTag": "latest"
        }
    ]
}

also running skopeo inspect from image-swapper pod:

skopeo inspect --retry-times 3 docker://000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 --creds $TOKEN
FATA[0000] Error parsing image name "docker://000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2": reading manifest 1.14.2 in 000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx: manifest unknown: Requested image not found 

W/A:
Restart image-swapper Deployment

@estahn estahn added the bug Something isn't working label Feb 10, 2024
@estahn
Copy link
Owner

estahn commented Feb 12, 2024

An internal cache is used to reduce the number of requests to the AWS API by keeping track of existing images. The cache does not expire by time but by number of items and cache size, e.g. if the cache gets too large items are purged.

The cache allows to set a TTL per item, which I think would be useful in this case. I will set the TTL to 24h by default.

@estahn estahn self-assigned this Feb 12, 2024
estahn added a commit that referenced this issue Feb 12, 2024
Introduces TTL to cache items to prevent incorrect swaps. The TTL is 24h + random 180 minutes to prevent a cache stampede.

fixes #632
estahn added a commit that referenced this issue Feb 13, 2024
Introduces TTL to cache items to prevent incorrect swaps. The TTL is 24h
+ random 180 minutes to prevent a cache stampede.

fixes #632
@vholer
Copy link

vholer commented Oct 18, 2024

NOTE: TTL is nice, but doesn't solve the problem completely. Having TTL configurable (even as 1 hour or less) would be nice, checking if final pod doesn't suffer from pull errors might also help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants