-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only respect controller refs for resources #503
base: master
Are you sure you want to change the base?
Conversation
What I am mainly concerned about is allowing Argo (i.e. gitops-engine) to delete resources that have owner references as long as none of the references have So I am struggling to see how I can get the behaviour I want at a point in the code where the sync options would be available. |
Codecov ReportBase: 55.75% // Head: 55.67% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #503 +/- ##
==========================================
- Coverage 55.75% 55.67% -0.09%
==========================================
Files 41 41
Lines 4525 4532 +7
==========================================
Hits 2523 2523
- Misses 1808 1814 +6
- Partials 194 195 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
8dd5d6d
to
7d406a1
Compare
I have added a new sync option, but it is only respected on a per-resource annotation basis. This works perfectly for my use case with Cluster API, but I am happy to make adjustments if required. |
@jannfis @spjmurray @jaideepr97 Just flagging that I believe this patch addresses argoproj/argo-cd#11972 and argoproj/argo-cd#12210. I have been running a custom build of Argo CD that includes this patch and can confirm that it completely fixes my issue. Build is here: https://github.com/stackhpc/argo-cd/pkgs/container/argocd/67404745?tag=v2.6.0-stackhpc.2 Like I said on the issue, this is similar but subtly different to @spjmurray's solution. In short, I don't believe we should ever delete anything with a reference that has I'm happy to do whatever tidying up is required to get this merged. |
Signed-off-by: Matt Pryor <[email protected]>
Signed-off-by: Matt Pryor <[email protected]>
7d406a1
to
b0e28e3
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
@mkjpryor, how does this handle node rollouts? We produce new AWSMachineTemplates, for instance, and when we do we no longer produce the old ones. Our MachineDeployments are then updated to reference the new templates. This triggers the creation of new MachineSets after which the replicas are scaled over from the old MachineSets to the new ones. The problem is that this change would immediately clobber the old AWSMachineTemplates, making the old MachineSets invalid. While scaling them down wouldn't be an issue, scaling them up would be. Unfortunately, scale up during a node rollout can happen and does so by scaling up the old MachineSet. That will fail to spawn new nodes because of the missing template. Have you run into this problem, and if so what do you do about it? |
The key problem this is solving is if it was generated by a helm template, and is no longer generated by the helm template, it gets deleted, period, do as I command! At present it's impossible to remove a machine deployment without some 3rd party automation that goes around deleting them manually due to owner references that get added by CAPI. I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into). I'm guessing if you want old machine templates to live on then you keep them alive in the helm chart, once your "upgrade" has been completed, then remove them. That would be the simple solution. Not perfect however as you'd either need to keep all templates alive forever (which happens implicitly at present) which would probably bloat your charts, and have some tooling to remove stale ones once the referencing machine deployments get killed. Hard problem however you look at it 😸 |
You are right - that is a major issue. In fact, it is so major that we never delete machine templates ( When you delete the cluster, they are cleaned up via cascading deletion because Cluster API puts non-controller owner references on them. |
When you say
which specific piece of the puzzle are you referring to here? |
Thank you, @mkjpryor. That's pretty much what I expected. Sadly, the Argo UI becomes cluttered and slow unless we clean up our old templates regularly because we have many different MachineDeployments. All of these require unique machine templates, which we update monthly in order to provide software updates via new machine images. Needless to say, the number of machine templates grows rapidly in our case, and Argo wants to track them all even though the vast majority are no longer needed. IMO, this problem is caused by CAPI applying the resourceOwners incorrectly. Even if Argo would delete resources with non-controller owners set, it would prematurely clobber the machine templates in our case. Instead, CAPI should set the resourceOwners for machine templates to the MachineSet resource(s) that directly depend on them. When the MachineDeployment controller deletes unneeded MachineSets, the k8s GC would naturally kick in. Machine templates used by KubeadmControlPlane resources would need to be handled a little differently. In that case, the ownerResources needs to be removed once the machine template is replaced. Argo should then be able to prune the machine template resource normally since it would no longer have any ownerReferences and would no longer be desired by gitops. I've opened a thread in the CAPI Slack about this, so hopefully we can figure something out. Please drop in if you would like to contribute to the discussion. |
This bit... We essentially have to:
I could probably do it more intelligently by adding an annotation to all the resources so we tie them to the helm inputs, rather than following the links, but it's still sub optimal when Argo should be capable of deleting them itself. |
@javanthropus I did try to open a discussion with CAPI directly here kubernetes-sigs/cluster-api#7913 it's definitely on the radar, but will take time. |
@spjmurray, thanks for that pointer. I piled onto that issue, and if the maintainers agree to my suggestion, I should be able to submit a PR. |
@mkjpryor do you have any news from maintainers for this? |
Not at present |
@crenshaw-dev can you take a look at this PR? |
Subscribed... |
I added this issue/PR to the ArgoCD monthly community meeting. You can see the notes here: https://docs.google.com/document/d/1ttgw98MO45Dq7ZUHpIiOIEfbyeitKHNfMjbY5dLLMKQ/edit?tab=t.0#heading=h.qamcy9ybyb10 I won't be able to attend (on vacation). If someone (maybe @mkjpryor?) could attend Wed Dec 4th 10am PST to surface this PR in a discussion that would be great. |
This code implements the semantics raised in #501, mainly so I could see what the effects would be. I am actually pretty happy that the result suits my use case with Cluster API quite happily.
However I am well aware that this code will break a lot of tests (mostly because none of the mock objects have
controller: true
on their owner references even when the real objects do, e.g. deployment/rs/pod) and that, because it changes the delete semantics it will never be accepted as-is.What I would really like to have is an additional sync option that controls this behaviour, and would appreciate any advice on how to achieve a similar effect at a place in the code where I am able to use sync options.