Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generation to all component statuses #5684

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ari-e
Copy link
Contributor

@ari-e ari-e commented Jul 16, 2024

What problem does this PR solve?

tidb-operator currently does not propagate the CR object meta's generation to the status. This is problematic because it means external systems cannot always tell whether the status corresponds to the current version of the CRD spec, or if it is an old status. For example consider the series of events

process A updates CR spec
process B reads CR status
tidb-operator reconciliation triggers on the object and updates status

Here, process B will think that the status corresponds to the updated spec as written by process A, when in reality it is a stale status belonging to a previous version of the CR. This is exacerbated if the operator has high latency or is briefly unavailable due to deploys or other reasons.

What is changed and how does it work?

Each component status within the overarching TidbClusterStatus now has a generation field in addition to its other fields like phase. Whenever the phase is updated by the respective controllers, the generation is also set to be equal to the current CR's generation. Kubernetes increments this generation anytime the spec for the CR is updated, so this provides a consistent way to indicate the status reflects the current version of the CR spec.

Note this practice of passing the generation through to the status is standard practice and is how Kubernetes deployments work, and Kubernetes source code even indicates that all custom operators should do this.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
    TidbCluster now registers status as a subresource, meaning changes to status do not increment the CR's generation number.
  • Other side effects:
    CRD changes need to be deployed to clusters

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

TidbCluster CRD now registers `status` as a subresource of the CR, so updates to status no longer update the `generation` meta field of the CR.

TiDB Operator now sets `observedGeneration` field on each component status anytime a component's `phase` is updated.

Copy link
Contributor

ti-chi-bot bot commented Jul 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bornchanger for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-bot
Copy link
Contributor

sre-bot commented Jul 16, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

ti-chi-bot bot commented Jul 16, 2024

Welcome @ari-e! It looks like this is your first PR to pingcap/tidb-operator 🎉

@ti-chi-bot ti-chi-bot bot added the size/L label Jul 16, 2024
@ari-e ari-e force-pushed the ari--add-generation-to-status branch from 0008610 to 44c6e1a Compare July 16, 2024 23:58
@csuzhangxc
Copy link
Member

Thanks for your contribution, but as our status is not a sub-resources of the CR now. when we updating the status, the generation in the metadata will also be updated. then this observed generation in status may be stale again. does this work in your case?

@ari-e
Copy link
Contributor Author

ari-e commented Jul 17, 2024

Hi @csuzhangxc thanks for the comment. Is not registering status as a subresource intentional? I can easily convert it to a subresource but then that would of course affect backwards compatibility. I still think that's the right thing to do though as it's more canonical k8s.

@csuzhangxc
Copy link
Member

How about using the name of observedGeneration instead of generation?

@ari-e ari-e force-pushed the ari--add-generation-to-status branch from 44c6e1a to 37657f4 Compare July 17, 2024 03:26
@ti-chi-bot ti-chi-bot bot added size/XL and removed size/L labels Jul 17, 2024
@ari-e
Copy link
Contributor Author

ari-e commented Jul 17, 2024

@csuzhangxc sure I updated it to observedGeneration and I also changed the annotations on TidbCluster to make it register status as a subresource.

@csuzhangxc
Copy link
Member

will it break any existing clusters after chaning it to subresources?

@ari-e
Copy link
Contributor Author

ari-e commented Jul 17, 2024

will it break any existing clusters after changing it to subresources?

The only behavior change should be this difference with the generation number. So if someone is depending on generation being incremented whenever the status changes that will break. But that seems like depending on undefined behavior to me. Generation is meant for spec changes.

This could be included in the next major version release if that's a concern.

@ari-e ari-e force-pushed the ari--add-generation-to-status branch from 37657f4 to 04e87ce Compare July 17, 2024 23:41
@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2024

Codecov Report

Attention: Patch coverage is 88.46154% with 3 lines in your changes missing coverage. Please review.

Project coverage is 61.48%. Comparing base (9ef26f8) to head (f7b56af).
Report is 15 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5684      +/-   ##
==========================================
+ Coverage   61.47%   61.48%   +0.01%     
==========================================
  Files         235      235              
  Lines       30653    30741      +88     
==========================================
+ Hits        18843    18902      +59     
- Misses       9920     9942      +22     
- Partials     1890     1897       +7     
Flag Coverage Δ
unittest 61.48% <88.46%> (+0.01%) ⬆️

@csuzhangxc
Copy link
Member

will it break any existing clusters after changing it to subresources?

The only behavior change should be this difference with the generation number. So if someone is depending on generation being incremented whenever the status changes that will break. But that seems like depending on undefined behavior to me. Generation is meant for spec changes.

This could be included in the next major version release if that's a concern.

I still think the change of subresource is too critical. Can you make this change only to one CRD and do more tests to ensure it works correctly, especially for upgrading from a previous version of TiDB Operator.

BTW, the e2e-example case failed.

Copy link
Contributor

ti-chi-bot bot commented Jul 25, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ari-e
Copy link
Contributor Author

ari-e commented Aug 1, 2024

@csuzhangxc thanks for the feedback. I'll do some more testing and fix the e2e. And again this can go in the next major version release if the compatibility change is a big concern. I'm busy with other projects for the next week or so but will pick this up after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants