-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodeReuse and automatedCleaningMode implementations: support for removal of a Metal3MachineTemplate #1319
Comments
/cc @hardys |
/cc @furkatgofurov7 |
I am unsure what tool is being used, but this is a breaking point of the node-reuse workflow in your use case. The feature was implemented as part of a "bigger" feature where we have the following use case:
Are you using scale-in feature, by any chance and would that help? |
In our case, a Helm chart defines the CAPI/Metal3 manifests (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster), and it is instantiated by a FluxCD HelmRelease. But my understanding is that the issue discussed here is quite generic: any tool doing garbage collection would have the same issue (e.g kustomize used with a GitOps tool like FluxCD or Argo, or a CRD/operator generating CAPI/Metal3 manifests for a cluster, etc). Also, I would tend to think that any tool not doing garbage collection on unused resource templates would be problematic: garbage collection is something that is needed to not leak resources on the long run.
Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.
Well, yes, the intent definitely is to use Using maxSurge zero helps yes, in the sense that it is necessary for nodeReuse to make any sense, but this does not address the issue raised in this issue: we need a way to preserve the Metal3MachineTemplates for nodeReuse to be effective. |
/triage accepted |
@Rozzii: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think this could be doable (finalizers, not ownerRefs). However, I have a few things to note first.
|
👍
I had the same reaction. I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).
Yes, this is the idea (or until m3mt-1 is not used by any Machine). My use of the "garbage collection" termed was only about deleting m3mt-1 once it stopped being used.
I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.
I guess this would be one option. The tooling we build for Sylva was done with in mind the short term need to fully support baremetal scenarios with Metal3, that no work was started for ClusterClass support in Metal3, and the fact that we felt that the kind of customization of manifests that clusterclass offers would require us to write runtime extensions for many many things. We thus opted for not trying clusterclass. |
This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.
I'm confused. I thought this is what you already do and why you created the issue? If you can avoid doing this then what is the issue? |
Sorry for being confusing, I mixed what I would like to be (be able to delete m3mt-1) with what we are constrained to do today (we have to preserve it), and some confusion comes from the distinction between "an API request is made to delete the object" and "the object is actually removed up" (all finalizer are cleared). Our goal would be the following: on an upgrade, create m3mt-2 from which new m3m will be built, and delete m3mt-1 (*); we're here assuming that we have a tool (for instance a Gitops tool, or Helm) that is able to produce the earlier versions of all resources, and if a rollback is wanted, it will be handled by that tool, which will produce a machine3machine template with same specs as m3mt-1. (*): what does "detele m3mt-1" do ? ... that would depend on how the metal3 evolves to allow that...
Sorry again for making this more complex than it has to be ;) |
Makes sense! Thanks for explaining! |
Interesting, how are you using disabling automated cleaning feature then, AFAICT it also uses M3MT to sync the value with M3M.
I don't agree with that, then what is the use of cluster-api-provider-metal3/baremetal/metal3machinetemplate_manager.go Lines 90 to 100 in 321b813
in the CAPM3 code? I mean we already have that unusuality as you are describing in the code base in other places.
I don't recall exactly the specifics, but I believe M3MT was considered source of truth since it is the object that is created first and cloned to create M3M afterward, meaning having a user's intention set (nodeReuse field) right at the object on the higher level was the intention. |
To be honest, to me this feels like abusing the API. We are treating the M3MT more like a Deployment, which it is not intended to be. Good that you brought it up though! I think we should consider carefully if this is something that should be changed in the future.
Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT. These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way. |
Well, we are only at a stage where we want to ensure that all the pieces are aligned to allow using nodeReuse and disable automated cleaning. I hadn't identified that (For automated cleaning, my understanding is that we can also benefit from it by setting it directly in the BMH resource, although this comes with some implications.) |
That was the design decision in the past agreed by the community, and of course, it might not always be ideal considering all use cases in advance, since node reuse had to take into account a lot of them due to upgrade scenarios.
strange, but why? That is the case with all |
Yes, that is right. For that feature, you can directly set it in BMH and it should take a precedence over other options |
I understand, and I don't suggest changing it without thinking things through. All I'm saying is that there is a risk when we do things in "weird" ways. I think it would be worth considering alternatives, especially since there were unforeseen consequences that came from this design. Perhaps we could find a much better long term implementation now that we know more 🙂
In general, changes can not be applied without a rolling update to the Machines. This is why the general rule is that the templates are immutable and should be updated by adding a new template and roll over to it. However, CAPI does allow exceptions so we are not directly breaking the contract here. It does complicate things though, and makes it much harder to reason about the state of the system, which can easily lead to more bugs. Ref: https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates |
Yes. Note that https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates gives a kind of frame to the exceptions to this loose rule:
I think that the idea is that the thing that are relevant for being mutable in the template are the things that the infra provider can implement on existing machines without a rolling update (like mem or CPU allocation for a virtual machine). |
Let me push an update here. In the context of Sylva, while working on rolling updates, we realized that:
The conclusion for us is that we're going to ensure that all these resources (Infra machine template and bootstrap provider config resource) are not deleted when we update the CAPI+providers resources. Thorough inventory work would be required to see how the different infra providers behave, and Kubeadm... but overall, with this kind of MachineSet controller behavior in mind it seems like the wise path is to assume that XXXMachineTemplates need to be around for quite a while and maybe not try to have a "spot fix" here in Metal3 to hold off the actual m3mt deletion with a finalizer. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
This is going back and forth between stale and active so I will just label it as frozen, feel free to continue the discussion, when implementation PR is pushed on the ticket I will remove the label. |
Context: controller needs the Metal3MachineTemplate to determine
nodeReuse
, on Metal3Machine deletionToday, when a metal3machine is deleted, the metal3machine controller tries to fetch the Metal3MachineTemplate that was used to create it, and if it finds it, uses it to determine if node reuse is desired (reading
Metal3MachineTemplate.spec.nodeReuse
).cluster-api-provider-metal3/baremetal/metal3machine_manager.go
Lines 588 to 606 in 9e5cf69
Implications on how cluster manifests are managed
If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that
nodeReuse
can't reliably be used.We've hit this issue in the context of Sylva project, where we use a Helm chart to produce both Metal3Machine and Metal3MachineTemplates resources (along with other CAPI resources), and we use dynamic names for Metal3Machinetemplates to compensate for their immutability ensure the trigger of a rolling upgrade when the content of the template changes. What happens it that, on a change of metal3machine template content, the old Metal3MachineTemplates is removed prior to the rolling upgrade, so on Metal3Machine deletion that template never exists anymore, and
nodeReuse
is considered disabled.Discussion
We can have a workaround by disabling automatic garbage collection, and adding our own GC tool aware of this use of a Metal3MachineTemplates by Metal3Machines do a smarter garbage collection after a rolling upgrade.
This is however not elegant, requires additional tooling, and Metal3 users need to know about this issue to be able to avoid it.
Would it be possible to improve the controller behavior to avoid that ?
In particular:
nodeReuse
" annotation set directly on a Metal3Machine at creation time if it can't find the Metal3MachineTemplate ?Same for automatedCleaningMode
(Added on 2023-11-28)
During the discussion around the initial description of this issue, it came up that there is the same design and possible code evolution question about
automatedCleaningMode
.The text was updated successfully, but these errors were encountered: