Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only respect controller refs for resources #503

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mkjpryor
Copy link

This code implements the semantics raised in #501, mainly so I could see what the effects would be. I am actually pretty happy that the result suits my use case with Cluster API quite happily.

However I am well aware that this code will break a lot of tests (mostly because none of the mock objects have controller: true on their owner references even when the real objects do, e.g. deployment/rs/pod) and that, because it changes the delete semantics it will never be accepted as-is.

What I would really like to have is an additional sync option that controls this behaviour, and would appreciate any advice on how to achieve a similar effect at a place in the code where I am able to use sync options.

@mkjpryor
Copy link
Author

What I am mainly concerned about is allowing Argo (i.e. gitops-engine) to delete resources that have owner references as long as none of the references have controller: true. However my understanding of the code is that gitops-engine identifies "top-level" resources (i.e. those without any owner references, as currently implemented) as it deploys the resource, and will only act on those resources at delete time but without checking the owner references again.

So I am struggling to see how I can get the behaviour I want at a point in the code where the sync options would be available.

@mkjpryor mkjpryor changed the title WIP: Only respect controller refs for resources WIP: Only respect controller refs for resources (advice wanted!) Jan 30, 2023
@codecov
Copy link

codecov bot commented Feb 1, 2023

Codecov Report

Base: 55.75% // Head: 55.67% // Decreases project coverage by -0.09% ⚠️

Coverage data is based on head (b0e28e3) compared to base (ed70eac).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
- Coverage   55.75%   55.67%   -0.09%     
==========================================
  Files          41       41              
  Lines        4525     4532       +7     
==========================================
  Hits         2523     2523              
- Misses       1808     1814       +6     
- Partials      194      195       +1     
Impacted Files Coverage Δ
pkg/cache/references.go 61.53% <0.00%> (-6.07%) ⬇️
pkg/sync/common/types.go 54.16% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mkjpryor mkjpryor force-pushed the feature/controller-refs branch from 8dd5d6d to 7d406a1 Compare February 1, 2023 12:55
@mkjpryor mkjpryor changed the title WIP: Only respect controller refs for resources (advice wanted!) Only respect controller refs for resources (advice wanted!) Feb 2, 2023
@mkjpryor mkjpryor changed the title Only respect controller refs for resources (advice wanted!) Only respect controller refs for resources Feb 2, 2023
@mkjpryor
Copy link
Author

mkjpryor commented Feb 2, 2023

I have added a new sync option, but it is only respected on a per-resource annotation basis.

This works perfectly for my use case with Cluster API, but I am happy to make adjustments if required.

@mkjpryor
Copy link
Author

mkjpryor commented Feb 6, 2023

@jannfis @spjmurray @jaideepr97

Just flagging that I believe this patch addresses argoproj/argo-cd#11972 and argoproj/argo-cd#12210.

I have been running a custom build of Argo CD that includes this patch and can confirm that it completely fixes my issue. Build is here: https://github.com/stackhpc/argo-cd/pkgs/container/argocd/67404745?tag=v2.6.0-stackhpc.2

Like I said on the issue, this is similar but subtly different to @spjmurray's solution. In short, I don't believe we should ever delete anything with a reference that has controller: true as we will probably end up fighting another controller. @spjmurray - this actually turns out to be critical for the Cluster API use case because CAPI propagates annotations from MachineDeployments to the MachineSets that they manage, resulting in the owner reference being ignored for the MachineSet as well unless controller references are always respected.

I'm happy to do whatever tidying up is required to get this merged.

@mkjpryor mkjpryor force-pushed the feature/controller-refs branch from 7d406a1 to b0e28e3 Compare February 24, 2023 09:31
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@javanthropus
Copy link

@mkjpryor, how does this handle node rollouts? We produce new AWSMachineTemplates, for instance, and when we do we no longer produce the old ones. Our MachineDeployments are then updated to reference the new templates. This triggers the creation of new MachineSets after which the replicas are scaled over from the old MachineSets to the new ones.

The problem is that this change would immediately clobber the old AWSMachineTemplates, making the old MachineSets invalid. While scaling them down wouldn't be an issue, scaling them up would be. Unfortunately, scale up during a node rollout can happen and does so by scaling up the old MachineSet. That will fail to spawn new nodes because of the missing template.

Have you run into this problem, and if so what do you do about it?

@spjmurray
Copy link

The key problem this is solving is if it was generated by a helm template, and is no longer generated by the helm template, it gets deleted, period, do as I command! At present it's impossible to remove a machine deployment without some 3rd party automation that goes around deleting them manually due to owner references that get added by CAPI. I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

I'm guessing if you want old machine templates to live on then you keep them alive in the helm chart, once your "upgrade" has been completed, then remove them. That would be the simple solution. Not perfect however as you'd either need to keep all templates alive forever (which happens implicitly at present) which would probably bloat your charts, and have some tooling to remove stale ones once the referencing machine deployments get killed.

Hard problem however you look at it 😸

@mkjpryor
Copy link
Author

mkjpryor commented Jul 31, 2023

@javanthropus

You are right - that is a major issue. In fact, it is so major that we never delete machine templates (OpenStackMachineTemaplates in our case) by using the helm.sh/resource-policy: keep annotation on them.

See https://github.com/stackhpc/capi-helm-charts/blob/main/charts/openstack-cluster/templates/node-group/openstack-machine-template.yaml#L61.

When you delete the cluster, they are cleaned up via cascading deletion because Cluster API puts non-controller owner references on them.

@mkjpryor
Copy link
Author

@spjmurray

When you say

I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

which specific piece of the puzzle are you referring to here?

@javanthropus
Copy link

Thank you, @mkjpryor. That's pretty much what I expected. Sadly, the Argo UI becomes cluttered and slow unless we clean up our old templates regularly because we have many different MachineDeployments. All of these require unique machine templates, which we update monthly in order to provide software updates via new machine images. Needless to say, the number of machine templates grows rapidly in our case, and Argo wants to track them all even though the vast majority are no longer needed.

IMO, this problem is caused by CAPI applying the resourceOwners incorrectly. Even if Argo would delete resources with non-controller owners set, it would prematurely clobber the machine templates in our case. Instead, CAPI should set the resourceOwners for machine templates to the MachineSet resource(s) that directly depend on them. When the MachineDeployment controller deletes unneeded MachineSets, the k8s GC would naturally kick in.

Machine templates used by KubeadmControlPlane resources would need to be handled a little differently. In that case, the ownerResources needs to be removed once the machine template is replaced. Argo should then be able to prune the machine template resource normally since it would no longer have any ownerReferences and would no longer be desired by gitops.

I've opened a thread in the CAPI Slack about this, so hopefully we can figure something out. Please drop in if you would like to contribute to the discussion.

@spjmurray
Copy link

@mkjpryor

This bit... We essentially have to:

  • Derive what MDs we expect to be generated by the chart, which in itself is a pain as we have to know how the chart generates resource names
  • Find all the KCTs referenced by those MDs
  • Final all the OSMTs referenced by those MDs
  • Delete all the resources that aren't part of those sets

I could probably do it more intelligently by adding an annotation to all the resources so we tie them to the helm inputs, rather than following the links, but it's still sub optimal when Argo should be capable of deleting them itself.

@spjmurray
Copy link

@javanthropus I did try to open a discussion with CAPI directly here kubernetes-sigs/cluster-api#7913 it's definitely on the radar, but will take time.

@javanthropus
Copy link

@spjmurray, thanks for that pointer. I piled onto that issue, and if the maintainers agree to my suggestion, I should be able to submit a PR.

@dntosas
Copy link

dntosas commented Feb 17, 2024

@mkjpryor do you have any news from maintainers for this?

@mkjpryor
Copy link
Author

mkjpryor commented Mar 7, 2024

Not at present

@dprotaso
Copy link

@crenshaw-dev can you take a look at this PR?

@todaywasawesome
Copy link

Subscribed...

@dprotaso
Copy link

I added this issue/PR to the ArgoCD monthly community meeting. You can see the notes here: https://docs.google.com/document/d/1ttgw98MO45Dq7ZUHpIiOIEfbyeitKHNfMjbY5dLLMKQ/edit?tab=t.0#heading=h.qamcy9ybyb10

I won't be able to attend (on vacation). If someone (maybe @mkjpryor?) could attend Wed Dec 4th 10am PST to surface this PR in a discussion that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants