Only respect controller refs for resources #503

mkjpryor · 2023-01-30T21:35:59Z

This code implements the semantics raised in #501, mainly so I could see what the effects would be. I am actually pretty happy that the result suits my use case with Cluster API quite happily.

However I am well aware that this code will break a lot of tests (mostly because none of the mock objects have controller: true on their owner references even when the real objects do, e.g. deployment/rs/pod) and that, because it changes the delete semantics it will never be accepted as-is.

What I would really like to have is an additional sync option that controls this behaviour, and would appreciate any advice on how to achieve a similar effect at a place in the code where I am able to use sync options.

Fixes argoproj/argo-cd#4764

mkjpryor · 2023-01-30T21:46:07Z

What I am mainly concerned about is allowing Argo (i.e. gitops-engine) to delete resources that have owner references as long as none of the references have controller: true. However my understanding of the code is that gitops-engine identifies "top-level" resources (i.e. those without any owner references, as currently implemented) as it deploys the resource, and will only act on those resources at delete time but without checking the owner references again.

So I am struggling to see how I can get the behaviour I want at a point in the code where the sync options would be available.

codecov · 2023-02-01T12:50:44Z

Codecov Report

❌ Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.67%. Comparing base (ed70eac) to head (b0e28e3).
⚠️ Report is 105 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/cache/references.go	0.00%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
- Coverage   55.75%   55.67%   -0.09%     
==========================================
  Files          41       41              
  Lines        4525     4532       +7     
==========================================
  Hits         2523     2523              
- Misses       1808     1814       +6     
- Partials      194      195       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mkjpryor · 2023-02-02T15:40:46Z

I have added a new sync option, but it is only respected on a per-resource annotation basis.

This works perfectly for my use case with Cluster API, but I am happy to make adjustments if required.

mkjpryor · 2023-02-06T21:53:41Z

@jannfis @spjmurray @jaideepr97

Just flagging that I believe this patch addresses argoproj/argo-cd#11972 and argoproj/argo-cd#12210.

I have been running a custom build of Argo CD that includes this patch and can confirm that it completely fixes my issue. Build is here: https://github.com/stackhpc/argo-cd/pkgs/container/argocd/67404745?tag=v2.6.0-stackhpc.2

Like I said on the issue, this is similar but subtly different to @spjmurray's solution. In short, I don't believe we should ever delete anything with a reference that has controller: true as we will probably end up fighting another controller. @spjmurray - this actually turns out to be critical for the Cluster API use case because CAPI propagates annotations from MachineDeployments to the MachineSets that they manage, resulting in the owner reference being ignored for the MachineSet as well unless controller references are always respected.

I'm happy to do whatever tidying up is required to get this merged.

Signed-off-by: Matt Pryor <[email protected]>

sonarqubecloud · 2023-02-24T09:32:13Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

javanthropus · 2023-07-28T15:44:26Z

@mkjpryor, how does this handle node rollouts? We produce new AWSMachineTemplates, for instance, and when we do we no longer produce the old ones. Our MachineDeployments are then updated to reference the new templates. This triggers the creation of new MachineSets after which the replicas are scaled over from the old MachineSets to the new ones.

The problem is that this change would immediately clobber the old AWSMachineTemplates, making the old MachineSets invalid. While scaling them down wouldn't be an issue, scaling them up would be. Unfortunately, scale up during a node rollout can happen and does so by scaling up the old MachineSet. That will fail to spawn new nodes because of the missing template.

Have you run into this problem, and if so what do you do about it?

spjmurray · 2023-07-31T09:49:50Z

The key problem this is solving is if it was generated by a helm template, and is no longer generated by the helm template, it gets deleted, period, do as I command! At present it's impossible to remove a machine deployment without some 3rd party automation that goes around deleting them manually due to owner references that get added by CAPI. I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

I'm guessing if you want old machine templates to live on then you keep them alive in the helm chart, once your "upgrade" has been completed, then remove them. That would be the simple solution. Not perfect however as you'd either need to keep all templates alive forever (which happens implicitly at present) which would probably bloat your charts, and have some tooling to remove stale ones once the referencing machine deployments get killed.

Hard problem however you look at it 😸

mkjpryor · 2023-07-31T09:54:45Z

@javanthropus

You are right - that is a major issue. In fact, it is so major that we never delete machine templates (OpenStackMachineTemaplates in our case) by using the helm.sh/resource-policy: keep annotation on them.

See https://github.com/stackhpc/capi-helm-charts/blob/main/charts/openstack-cluster/templates/node-group/openstack-machine-template.yaml#L61.

When you delete the cluster, they are cleaned up via cascading deletion because Cluster API puts non-controller owner references on them.

mkjpryor · 2023-07-31T09:58:27Z

@spjmurray

When you say

I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

which specific piece of the puzzle are you referring to here?

javanthropus · 2023-07-31T13:16:38Z

Thank you, @mkjpryor. That's pretty much what I expected. Sadly, the Argo UI becomes cluttered and slow unless we clean up our old templates regularly because we have many different MachineDeployments. All of these require unique machine templates, which we update monthly in order to provide software updates via new machine images. Needless to say, the number of machine templates grows rapidly in our case, and Argo wants to track them all even though the vast majority are no longer needed.

IMO, this problem is caused by CAPI applying the resourceOwners incorrectly. Even if Argo would delete resources with non-controller owners set, it would prematurely clobber the machine templates in our case. Instead, CAPI should set the resourceOwners for machine templates to the MachineSet resource(s) that directly depend on them. When the MachineDeployment controller deletes unneeded MachineSets, the k8s GC would naturally kick in.

Machine templates used by KubeadmControlPlane resources would need to be handled a little differently. In that case, the ownerResources needs to be removed once the machine template is replaced. Argo should then be able to prune the machine template resource normally since it would no longer have any ownerReferences and would no longer be desired by gitops.

I've opened a thread in the CAPI Slack about this, so hopefully we can figure something out. Please drop in if you would like to contribute to the discussion.

spjmurray · 2023-08-01T08:23:02Z

@mkjpryor

This bit... We essentially have to:

Derive what MDs we expect to be generated by the chart, which in itself is a pain as we have to know how the chart generates resource names
Find all the KCTs referenced by those MDs
Final all the OSMTs referenced by those MDs
Delete all the resources that aren't part of those sets

I could probably do it more intelligently by adding an annotation to all the resources so we tie them to the helm inputs, rather than following the links, but it's still sub optimal when Argo should be capable of deleting them itself.

spjmurray · 2023-08-01T08:26:58Z

@javanthropus I did try to open a discussion with CAPI directly here kubernetes-sigs/cluster-api#7913 it's definitely on the radar, but will take time.

javanthropus · 2023-08-01T20:11:29Z

@spjmurray, thanks for that pointer. I piled onto that issue, and if the maintainers agree to my suggestion, I should be able to submit a PR.

dntosas · 2024-02-17T09:59:51Z

@mkjpryor do you have any news from maintainers for this?

mkjpryor · 2024-03-07T12:03:30Z

Not at present

dprotaso · 2024-10-17T16:54:03Z

@crenshaw-dev can you take a look at this PR?

todaywasawesome · 2024-11-14T18:17:06Z

Subscribed...

dprotaso · 2024-11-20T21:59:57Z

I added this issue/PR to the ArgoCD monthly community meeting. You can see the notes here: https://docs.google.com/document/d/1ttgw98MO45Dq7ZUHpIiOIEfbyeitKHNfMjbY5dLLMKQ/edit?tab=t.0#heading=h.qamcy9ybyb10

I won't be able to attend (on vacation). If someone (maybe @mkjpryor?) could attend Wed Dec 4th 10am PST to surface this PR in a discussion that would be great.

dprotaso · 2025-06-16T23:24:14Z

Hey @mkjpryor I followed up with some of the maintainers in the original slack thread here:

https://cloud-native.slack.com/archives/C01TSERG0KZ/p1750115707909679?thread_ts=1675116027.741979&cid=C01TSERG0KZ

It seems like the approach in the PR is good - are you able to resolve the merge conflicts?

dmosesson · 2025-08-19T13:33:10Z

At what point does someone else have the ability to resolve these merge conflicts?

mkjpryor · 2025-08-21T11:11:10Z

I'm afraid I have moved on from the company whose organisation the fork is in, and am not working on the product in which this was an issue any more.

As such I'm unable to commit to a time when I might get to rebasing this. If somebody else has time then I'm happy for them to go ahead and do it in another PR.

dprotaso · 2025-09-02T16:46:17Z

rebase is here: #771

crenshaw-dev · 2025-09-02T18:40:05Z

Closing in favor of the rebase PR. Thanks @mkjpryor for the original PR and @dprotaso for picking it up!

mkjpryor changed the title ~~WIP: Only respect controller refs for resources~~ WIP: Only respect controller refs for resources (advice wanted!) Jan 30, 2023

mkjpryor force-pushed the feature/controller-refs branch from 8dd5d6d to 7d406a1 Compare February 1, 2023 12:55

mkjpryor changed the title ~~WIP: Only respect controller refs for resources (advice wanted!)~~ Only respect controller refs for resources (advice wanted!) Feb 2, 2023

mkjpryor changed the title ~~Only respect controller refs for resources (advice wanted!)~~ Only respect controller refs for resources Feb 2, 2023

mkjpryor mentioned this pull request Feb 3, 2023

Ignore Owner References argoproj/argo-cd#11972

Open

Matt Pryor added 2 commits February 24, 2023 09:30

Only respect controller refs for resources

7f11a50

Signed-off-by: Matt Pryor <[email protected]>

Move code behind a resource-specific annotation

b0e28e3

Signed-off-by: Matt Pryor <[email protected]>

mkjpryor force-pushed the feature/controller-refs branch from 7d406a1 to b0e28e3 Compare February 24, 2023 09:31

mkjpryor mentioned this pull request Mar 7, 2024

Respect controller flag in owner references argoproj/argo-cd#12210

Closed

dprotaso mentioned this pull request Oct 17, 2024

fix using KNative with ArgoCD (don't set ownerReferences for webhooks) knative/serving#15483

Closed

sd109 mentioned this pull request Apr 29, 2025

What is ControllerReferencesOnly=true? Argocd does not mention it. azimuth-cloud/capi-helm-charts#454

Open

dprotaso mentioned this pull request Sep 2, 2025

Rebase of #503 #771

Open

crenshaw-dev closed this Sep 2, 2025

Only respect controller refs for resources #503

Only respect controller refs for resources #503

Uh oh!

Conversation

mkjpryor commented Jan 30, 2023 • edited by crenshaw-dev Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkjpryor commented Jan 30, 2023

Uh oh!

codecov bot commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mkjpryor commented Feb 2, 2023

Uh oh!

mkjpryor commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 24, 2023

Uh oh!

javanthropus commented Jul 28, 2023

Uh oh!

spjmurray commented Jul 31, 2023

Uh oh!

mkjpryor commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkjpryor commented Jul 31, 2023

Uh oh!

javanthropus commented Jul 31, 2023

Uh oh!

spjmurray commented Aug 1, 2023

Uh oh!

spjmurray commented Aug 1, 2023

Uh oh!

javanthropus commented Aug 1, 2023

Uh oh!

dntosas commented Feb 17, 2024

Uh oh!

mkjpryor commented Mar 7, 2024

Uh oh!

dprotaso commented Oct 17, 2024

Uh oh!

todaywasawesome commented Nov 14, 2024

Uh oh!

dprotaso commented Nov 20, 2024

Uh oh!

dprotaso commented Jun 16, 2025

Uh oh!

dmosesson commented Aug 19, 2025

Uh oh!

mkjpryor commented Aug 21, 2025

Uh oh!

dprotaso commented Sep 2, 2025

Uh oh!

crenshaw-dev commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mkjpryor commented Jan 30, 2023 •

edited by crenshaw-dev

Loading

codecov bot commented Feb 1, 2023 •

edited

Loading

mkjpryor commented Feb 6, 2023 •

edited

Loading

mkjpryor commented Jul 31, 2023 •

edited

Loading