Issue with tags removed from ECR #632

tcyran · 2023-12-14T11:18:29Z

What we did:
We introduced LifeCyclePolicy into ECR to avoid having cached really old images.
We set this to keep 3 latest tags, which in effect has removed a lot of tags including old ones and currently used.
After cleanup k8s-image-swapper still recognized image as existing in ECR and mutate pod to start with ECR cached image, which ends up with ImagePullBackOff

What is the issue:
Seems that there is some cache for skopeo, which see image even if it not exists.
After deleting/recreating image-swapper pod situation get backs to normal.

Steps to reproduce:

Start deployment with nginx:1:14.2
Wait until k8s-image-swapper will cache image
Restart nginx deployment - it will be started with cached image
Remove image tag from ECR
Restart nginx deployment - it will fall into ImagePullBackOff

Logs:

2023-12-14T12:07:18+01:00 11:07AM DBG github.com/estahn/[email protected]/pkg/webhook/image_swapper.go:285 > jmespath search results filter="obj.metadata.namespace == 'kube-system'" results=false
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/registry/ecr.go:239 > found in cache kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 ref=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM DBG github.com/estahn/[email protected]/pkg/webhook/image_swapper.go:251 > set new container image image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/registry/ecr.go:239 > found in cache kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 ref=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 source-image=docker.io/library/nginx:1.14.2 target-image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d
2023-12-14T12:07:18+01:00 11:07AM TRC github.com/estahn/[email protected]/pkg/webhook/image_copier.go:71 > image copy aborted: image already present in target registry kind="/v1, Kind=Pod" name= namespace=tcn-personal-1 source-image=docker.io/library/nginx:1.14.2 target-image=000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 uid=fe091146-17ba-42af-863f-f937b757365d

Additional info:
Prove that image-tag is missing

aws ecr list-images --repository-name docker.io/library/nginx --filter '{ "tagStatus": "TAGGED" }'
    "imageIds": [
        {
            "imageDigest": "sha256:644a70516a26004c97d0d85c7fe1d0c3a67ea8ab7ddf4aff193d9f301670cf36",
            "imageTag": "1.21.3"
        },
        {
            "imageDigest": "sha256:08bc36ad52474e528cc1ea3426b5e3f4bad8a130318e3140d6cfe29c8892c7ef",
            "imageTag": "latest"
        }
    ]
}

also running skopeo inspect from image-swapper pod:

skopeo inspect --retry-times 3 docker://000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2 --creds $TOKEN
FATA[0000] Error parsing image name "docker://000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx:1.14.2": reading manifest 1.14.2 in 000000000000.dkr.ecr.eu-west-1.amazonaws.com/docker.io/library/nginx: manifest unknown: Requested image not found

W/A:
Restart image-swapper Deployment

The text was updated successfully, but these errors were encountered:

estahn · 2024-02-12T12:25:37Z

An internal cache is used to reduce the number of requests to the AWS API by keeping track of existing images. The cache does not expire by time but by number of items and cache size, e.g. if the cache gets too large items are purged.

The cache allows to set a TTL per item, which I think would be useful in this case. I will set the TTL to 24h by default.

Introduces TTL to cache items to prevent incorrect swaps. The TTL is 24h + random 180 minutes to prevent a cache stampede. fixes #632

vholer · 2024-10-18T11:14:53Z

NOTE: TTL is nice, but doesn't solve the problem completely. Having TTL configurable (even as 1 hour or less) would be nice, checking if final pod doesn't suffer from pull errors might also help.

estahn added the bug Something isn't working label Feb 10, 2024

estahn self-assigned this Feb 12, 2024

estahn added a commit that referenced this issue Feb 12, 2024

fix: expire cache items after 24h+

cc4c3d4

Introduces TTL to cache items to prevent incorrect swaps. The TTL is 24h + random 180 minutes to prevent a cache stampede. fixes #632

estahn mentioned this issue Feb 12, 2024

fix: expire cache items after 24h+ #669

Merged

estahn closed this as completed in #669 Feb 13, 2024

estahn added a commit that referenced this issue Feb 13, 2024

fix: expire cache items after 24h+ (#669)

f541b8d

Introduces TTL to cache items to prevent incorrect swaps. The TTL is 24h + random 180 minutes to prevent a cache stampede. fixes #632

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with tags removed from ECR #632

Issue with tags removed from ECR #632

tcyran commented Dec 14, 2023

estahn commented Feb 12, 2024

vholer commented Oct 18, 2024

Issue with tags removed from ECR #632

Issue with tags removed from ECR #632

Comments

tcyran commented Dec 14, 2023

estahn commented Feb 12, 2024

vholer commented Oct 18, 2024