Fix deployment status propagation when scaling from zero #15550

skonto · 2024-10-04T14:32:17Z

Proposed Changes

Introduces a new PA condition (PodAutoscalerConditionScaleTargetScaled) that detects failures during scaling to zero, covering the K8s gaps where deployment status is not updated correctly. The condition is set to false just before we scale down to zero (before the deployment update happens) and if pods are crashing. We set it back to true when we scale from zero and we have enough ready pods.
Previously when deployment was scaled down to zero, revision ready status would be true (and stay that way), but with this patch the pod error is detected and propagated:

Ksvc status:

{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Revision \"revision-failure-00001\" failed with message: Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\".",
    "reason": "RevisionFailed",
    "status": "False",
    "type": "ConfigurationsReady"
},

Revision:

k  get revision
NAME                     CONFIG NAME        GENERATION   READY   REASON             ACTUAL REPLICAS   DESIRED REPLICAS
revision-failure-00001   revision-failure   1            False   ImagePullBackOff   0                 0

PA status:
{
    "lastTransitionTime": "2024-10-04T13:57:35Z",
    "message": "Back-off pulling image \"index.docker.io/skonto/revisionfailure@sha256:c7dd34a5919877b89617c3a0df7382e7de0f98318f2c12bf4374bb293f104977\"",
    "reason": "ImagePullBackOff",
    "status": "False",
    "type": "ScaleTargetScaled"
}
],

Updates the pa status propagation logic in the revision reconciler.
Extends a bit the resource quota e2e test to show that when deployment is scaled to zero we will still report the error. That is irrelevant to this patch but we want to show that we cover certain scenarios more. Probably it would be good to add more e2e tests anyway.
The steps to test is simply start a skvc, let it scale to zero then remove the image from your registry, block any access (kill internet) and then issue a request.

codecov · 2024-10-04T14:36:56Z

Codecov Report

Attention: Patch coverage is 38.02817% with 44 lines in your changes missing coverage. Please review.

Project coverage is 80.63%. Comparing base (72fdded) to head (5ae57a6).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/reconciler/autoscaling/kpa/scaler.go	26.66%	19 Missing and 3 partials ⚠️
pkg/resources/pods.go	0.00%	9 Missing ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go	33.33%	4 Missing ⚠️
pkg/apis/serving/v1/revision_lifecycle.go	42.85%	3 Missing and 1 partial ⚠️
pkg/reconciler/autoscaling/kpa/kpa.go	57.14%	2 Missing and 1 partial ⚠️
pkg/testing/functional.go	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15550      +/-   ##
==========================================
- Coverage   80.77%   80.63%   -0.14%     
==========================================
  Files         222      222              
  Lines       18035    18102      +67     
==========================================
+ Hits        14567    14597      +30     
- Misses       3094     3128      +34     
- Partials      374      377       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skonto · 2024-10-07T10:26:01Z

pkg/apis/serving/v1/revision_lifecycle.go

+		// Mark resource unavailable if we are scaling back to zero, but we never achieved the required scale
+		// and deployment status was not updated properly by K8s. For example due to an image pull error.
+		if ps.ScaleTargetNotScaled() {
+			condScaled := ps.GetCondition(autoscalingv1alpha1.PodAutoscalerConditionScaleTargetScaled)


We could set ContainerHealthyFalse here too but we need #15503

github-actions · 2025-01-06T01:28:42Z

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Reopen with /reopen. Mark as fresh by adding the
comment /remove-lifecycle stale.

skonto · 2025-01-17T14:58:27Z

/remove-lifecycle stale

knative-prow · 2025-01-20T13:50:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~OWNERS~~ [skonto]
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow-robot · 2025-01-22T06:44:21Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 4, 2024

knative-prow bot requested review from dprotaso and ReToCode October 4, 2024 14:32

skonto requested a review from dsimansk October 4, 2024 14:32

skonto assigned dprotaso Oct 4, 2024

skonto added this to the v1.16.0 milestone Oct 4, 2024

skonto commented Oct 7, 2024

View reviewed changes

skonto changed the title ~~[wip] Fix deployment status propagation when scaling from zero~~ Fix deployment status propagation when scaling from zero Oct 7, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2025

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2025

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2025

skonto added 4 commits January 20, 2025 15:45

Fix deployment status propagation when scaling from zero

decf681

lint

fff55ff

clean up

8c18f73

fixes

39037b8

skonto force-pushed the fix_dep_rollout_final branch from 5ba5209 to 39037b8 Compare January 20, 2025 13:50

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2025

lint

5ae57a6

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deployment status propagation when scaling from zero #15550

Fix deployment status propagation when scaling from zero #15550

skonto commented Oct 4, 2024 •

edited

Loading

codecov bot commented Oct 4, 2024 •

edited

Loading

skonto Oct 7, 2024

github-actions bot commented Jan 6, 2025

skonto commented Jan 17, 2025

knative-prow bot commented Jan 20, 2025

knative-prow-robot commented Jan 22, 2025

Fix deployment status propagation when scaling from zero #15550

Are you sure you want to change the base?

Fix deployment status propagation when scaling from zero #15550

Conversation

skonto commented Oct 4, 2024 • edited Loading

Proposed Changes

codecov bot commented Oct 4, 2024 • edited Loading

Codecov Report

skonto Oct 7, 2024

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2025

skonto commented Jan 17, 2025

knative-prow bot commented Jan 20, 2025

knative-prow-robot commented Jan 22, 2025

skonto commented Oct 4, 2024 •

edited

Loading

codecov bot commented Oct 4, 2024 •

edited

Loading