Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waiting for completion of hook and hook never succeds #6880

Open
rajivml opened this issue Aug 2, 2021 · 84 comments · Fixed by argoproj/argo-helm#2861
Open

waiting for completion of hook and hook never succeds #6880

rajivml opened this issue Aug 2, 2021 · 84 comments · Fixed by argoproj/argo-helm#2861
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache type:bug

Comments

@rajivml
Copy link

rajivml commented Aug 2, 2021

HI,

We are seeing this issue quite often where app sync is getting stuck in "waiting for completion of hook" and these hooks are never getting completed

As you can see the below application got stuck on secret creation phase and some how that secret never got created

image

Stripped out all un-necessary details. Now this is how the secret is created and used by the job.

apiVersion: v1
kind: Secret
metadata:
  name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    helm.sh/hook-weight: "-5"
type: Opaque
data:
  xxxx

apiVersion: batch/v1
kind: Job
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "-4"
spec:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
        - name: app-settings
          configMap:
            name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
        - name: app-secrets
          secret:
            secretName: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}

kubectl -n argocd logs argocd-server-768f46f469-j98h6 | grep xxx-migrations - No matching logs
kubectl -n argocd logs argocd-repo-server-57bdbf899c-9lxhr | grep xxx-migrations - No matching logs
kubectl -n argocd logs argocd-repo-server-57bdbf899c-7xvs7 | grep xxx-migrations - No matching logs
kubectl -n argocd logs argocd-server-768f46f469-tqp8p | grep xxx-migrations - No matching logs

[testadmin@server0 ~]$ kubectl -n argocd logs argocd-application-controller-0 | grep orchestrator-migrations
time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:19:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:19:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:22:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:28:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:28:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:31:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx
time="2021-08-02T02:31:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx

Environment:

  • 3 Node RKE2 Cluster
  • OS: RHEL 8.4
  • K8's setup on Azure VM's

ArgoCD Version: 2.0.1

Please let me know in case of any other info required

@rajivml rajivml added the bug Something isn't working label Aug 2, 2021
@rajivml
Copy link
Author

rajivml commented Aug 2, 2021

I have terminated the app sync and re-synced it again and the sync is successful now but this can't happen because if it happens CI / CD runs and also the automation that we have done to install apps via argoCD CLI would fail.

@alexmt
Copy link
Collaborator

alexmt commented Aug 2, 2021

I suspect this is fixed by #6294 . The fix is available in https://github.com/argoproj/argo-cd/releases/tag/v2.0.3 . Can you try upgrading please?

@rajivml
Copy link
Author

rajivml commented Aug 3, 2021

sure, thanks, we recently upgraded our develop to to use 2.0.5 and this happened on our prod build which is on 2.0.1. I will see if this can repro on our dev branch. Thanks !

@om3171991
Copy link

@alexmt - We are using the below version of ArgoCD and seeing the same issue with Contour helm. Application is waiting for PreSync Job to complete whereas on a cluster I can see the job is completed.

{
"Version": "v2.1.3+d855831",
"BuildDate": "2021-09-29T21:51:21Z",
"GitCommit": "d855831540e51d8a90b1006d2eb9f49ab1b088af",
"GitTreeState": "clean",
"GoVersion": "go1.16.5",
"Compiler": "gc",
"Platform": "linux/amd64",
"KsonnetVersion": "v0.13.1",
"KustomizeVersion": "v4.2.0 2021-06-30T22:49:26Z",
"HelmVersion": "v3.6.0+g7f2df64",
"KubectlVersion": "v0.21.0",
"JsonnetVersion": "v0.17.0"
}

@illagrenan
Copy link

I have the same problem in version 2.2.0.

@pseymournutanix
Copy link

I have the same problem on the 2.3.0 RC1 as well

@jaydipdave
Copy link

The PreSync hook, PostSync hook, and "Syncing" (while No Operation Running) are the only long pending major issues in ArgoCD at the moment.

@aslamkhan-dremio
Copy link

Hello. I am still seeing this in v2.2.4. PreSync hook is scheduled, Job starts, runs to completion, Argo sits there spinning "Progressing" until terminated. To work around it, we are terminating the op and using 'sync --strategy=apply' (disabling the hook) and running our job out of band.

Kube events during the sync confirm the job success. I no longer see the job/pod (per those events) if I check the namespace directly.

LAST SEEN TYPE REASON OBJECT MESSAGE 22m Normal Scheduled pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Successfully assigned dcs-prodemea-ns/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l to gke-service-nap-e2-standard-8-1oj503q-5bf9adda-f9t6 22m Normal Pulling pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Pulling image "gcr.io/dremio-1093/accept-release:v3" 21m Normal Pulled pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Successfully pulled image "gcr.io/dremio-1093/accept-release:v3" in 48.040095979s 21m Normal Created pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Created container dcs-artifact 21m Normal Started pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Started container dcs-artifact 22m Normal SuccessfulCreate job/dcs-artifact-promoter0ba458b-presync-1645132298 Created pod: dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l

image

Let me know if I can provide any diagnostics to help.

@MariaJohny
Copy link

We face the same issue in 2.2.5 as well.

@MariaJohny
Copy link

I suspect this is fixed by #6294 . The fix is available in https://github.com/argoproj/argo-cd/releases/tag/v2.0.3 . Can you try upgrading please?

Does it work with 2.0.3 or 2.2.2?

@ceguimaraes
Copy link

I can confirm the error was fixed on 2.0.3. We recently upgraded to 2.3.3 and we are experiencing the error again.

@yuha0
Copy link

yuha0 commented Jun 1, 2022

We started experiencing this issue after upgrading to 2.3.3. Before that we were on 2.2.3. I am not 100% sure but I do not recall we had any issue with 2.2.3.

@warmfusion
Copy link
Contributor

warmfusion commented Jun 16, 2022

We're seeing a similar issue on the syncfailed hook which means we can't actually terminate the sync action.

The job doesn't exist in the target namespace, and we've tried to trick argo by creating a job with the same name, namespace, and annotations as we'd expect to see with a simple echo "done' action but nothing is helping.

image

ArgoCD Version;

{"Version":"v2.3.4+ac8b7df","BuildDate":"2022-05-18T11:41:37Z","GitCommit":"ac8b7df9467ffcc0920b826c62c4b603a7bfed24","GitTreeState":"clean","GoVersion":"go1.17.10","Compiler":"gc","Platform":"linux/amd64","KsonnetVersion":"v0.13.1","KustomizeVersion":"v4.4.1 2021-11-11T23:36:27Z","HelmVersion":"v3.8.0+gd141386","KubectlVersion":"v0.23.1","JsonnetVersion":"v0.18.0"}

@margueritepd
Copy link
Contributor

margueritepd commented Sep 27, 2022

To add some information here, we are running into the same issue ("waiting for completion of hook" when the hook has already completed), and it happens when we are attempting to sync to a revision that is not the targetRevision for the app. When we sync an app with hooks to the same revision as the targetRevision, we do not run into this.

expand for argo version
{
    "Version": "v2.3.4+ac8b7df",
    "BuildDate": "2022-05-18T11:41:37Z",
    "GitCommit": "ac8b7df9467ffcc0920b826c62c4b603a7bfed24",
    "GitTreeState": "clean",
    "GoVersion": "go1.17.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KsonnetVersion": "v0.13.1",
    "KustomizeVersion": "v4.4.1 2021-11-11T23:36:27Z",
    "HelmVersion": "v3.8.0+gd141386",
    "KubectlVersion": "v0.23.1",
    "JsonnetVersion": "v0.18.0"
}

We are running 2 application-controller replicas in HA setup as per https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/ . I have verified we do not have a leftover instance of argo before it used stateful-sets.

@lacarvalho91
Copy link
Contributor

lacarvalho91 commented Oct 4, 2022

i had a similar problem when i was configuring resource inclusions, i wrote down what happened here: #10756 (comment)

@pseymournutanix
Copy link

I am still seeing this with 2.5.0-rc1

@cscorley
Copy link

cscorley commented Oct 7, 2022

We resolved this symptom on v2.4.12+41f54aa for Apps that had many Pods by adding a resource exclusion along these lines to our config map:

data:
  resource.exclusions: |
    - apiGroups:
        - '*'
      kinds:
        - 'Pod'
      clusters:
        - '*'

Prior to this, we would have pre-sync job hooks never completing in the ArgoCD UI, but would have actually be completed in Kubernetes. Sometimes, invalidating the cluster cache would help Argo recognize the job was completed, but most of the time not.

We believe the timeouts were related to needing to enumerate an excessive amount of entities and just simply never could finish before the next status refresh occurred. We do not utilize viewing the status of Pods through ArgoCD UI, so this solution is fine for us. Bonus factor for us is that the UI is much more robust now as well 🙂

@DasJayYa
Copy link

We had this issue and it was relating to a customers Job failing to initialise due to a bad secret mounting. You can validate this by checking the events in the namespace the job is being spun up to see if its failing to create.

@dejanzele
Copy link

Hello Argo community :)

I am fairly familiar with ArgoCD codebase and API, and I'd happily try to repay you for building such an awesome project by trying to have a stab at this issue, if there are no objections?

@pritam-acquia
Copy link

Hello Argo community :)

I am fairly familiar with ArgoCD codebase and API, and I'd happily try to repay you for building such an awesome project by trying to have a stab at this issue, if there are no objections?

I will highly appreciate..!

@williamcodes
Copy link

I would also highly appreciate that!

@vumdao
Copy link

vumdao commented Feb 11, 2023

I'm seeing this issue with v2.6.1+3f143c9

@linuxbsdfreak
Copy link

I am also seeing this issue when installing kubevela with argocd with the version v2.6.1+3f143c9

    message: >-
      waiting for completion of hook
      /ServiceAccount/kube-vela-vela-core-admission and 3 more hooks

@micke
Copy link
Contributor

micke commented Feb 13, 2023

We also had this issue and it was resolved once we set ARGOCD_CONTROLLER_REPLICAS.

Instructions here: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller

If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas. To enable sharding increase the number of replicas in argocd-application-controller StatefulSet and repeat number of replicas in ARGOCD_CONTROLLER_REPLICAS environment variable. The strategic merge patch below demonstrates changes required to configure two controller replicas.

@boedy
Copy link

boedy commented Feb 13, 2023

I rolled back from 1.6.1 to 1.5.10. Both versions keep waiting for completion of hook, which has already successfully completed.

I also tried @micke's recommendation (changing the ARGOCD_CONTROLLER_REPLICAS from 1 to 3). Doesn't make a difference unfortunately.

@linuxbsdfreak
Copy link

In my case i am only installing the application on a single cluster. That is the only application that is failing

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-vela
  annotations:
    argocd.argoproj.io/sync-wave: "10"
  finalizers:
  - resources-finalizer.argocd.argoproj.io
  namespace: argocd
spec:
  destination:
    namespace: vela-system
    name: in-cluster
  project: default
  source:
    chart: vela-core
    repoURL: https://kubevelacharts.oss-accelerate.aliyuncs.com/core
    targetRevision: 1.7.3
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
     - ApplyOutOfSyncOnly=true
     - CreateNamespace=true
     - PruneLast=true
     - ServerSideApply=true
     - Validate=true
     - Replace=true
    retry:
      limit: 30
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m0s

@boedy
Copy link

boedy commented Feb 13, 2023

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It's still a bug that should be addressed, I'm just sharing this for others to work around it.

@zfrhv
Copy link

zfrhv commented Feb 27, 2023

I had this problem when i had CR which CRD is still not created, and a job with Sync hook

so argocd couldnt apply the custom resource because there was no crd yet, and the hook started and then disappeared. I guess because argo retries to sync the CR and it also restarts the hook somehow
(btw I was using SkiDryRunOnMissingResource for the CR)

so I just did that the hook will be PostSync.
the CR was retrying until the crd were created, and only after the CR was successfully created then the PostSync hook started and completed successfully

@adlnc
Copy link

adlnc commented Mar 3, 2023

Encountered similar behavior as described in this issue while upgrading from v2.5.12+9cd67b1 to v2.6.3+e05298b
pre-upgrade hooks on different applications with various number of pods and jobs had same symptoms.
Sync operation is running forever.
Had the feeling this random event seems to appear more frequently while using the argo cli.

@slashr
Copy link

slashr commented May 2, 2024

Issue has been fixed

prometheus-community/helm-charts#4510

@j809
Copy link

j809 commented May 2, 2024

The job job-createSecret.yaml does not complete syncing in 58.3.2. Setting prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished to 30s helped me solve the problem.

@prashant0085
Copy link

Still facing the issue when trying to deploy kube prom stack helm chart version 58.6.1 where prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished is set to 60
Argocd version: v2.9.0+9cf0c69
image
image

@jsantosa-minsait
Copy link

jsantosa-minsait commented May 30, 2024

Still facing the issue when trying to deploy kube prom stack helm chart version 58.6.1 where prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished is set to 60 Argocd version: v2.9.0+9cf0c69 image image

I have the same issue, same configuration used fo ttlSecondsAfterFinished, tested 60 and 30 seconds.

prometheusOperator:
  enabled: true
  admissionWebhooks:
    patch:
      enabled: true
      ttlSecondsAfterFinished: 30

@prashant0085
Copy link

@jsantosa-minsait Is by chance you have enabled istio sidecar injection ? that causes the pod to complete the patching and creation but the pod keeps on running.

@jsantosa-minsait
Copy link

@jsantosa-minsait Is by chance you have enabled istio sidecar injection ? that causes the pod to complete the patching and creation but the pod keeps on running.

Hi @prashant0085 no, I don't. I have Cilium installed and Kyverno with admissions controllers hooks that may alter or patch the resource. However this is not the case.

@ilabrovic
Copy link

ilabrovic commented Jun 25, 2024

Hi, experiencing this issue as well.
Environment: Openshift 4.14, Argocd v2.10.10+9b3d0c0, Postsynchook job.
Just a simple kustomization.yaml with 2 resources and a postsynchookjob
No Helm.
The job actually takes about 2 minutes to complete, but argocd only sets the job as finished after approx 10 minutes.
Tried different settings (0, 60, 120) on ttlSecondsAfterFinished in the job spec, but no change in behaviour
Also monitored the memory and cpu usage of the argocd pods (controller, repo, applicationsetcontroller, etc), no pod even comes close to limitcpu or limitmemory, so no issue there...

@alexmt alexmt added component:core Syncing, diffing, cluster state cache type:bug labels Jun 25, 2024
@azorahai3724
Copy link

Hi, experiencing this issue as well. Environment: Openshift 4.14, Argocd v2.10.10+9b3d0c0, Postsynchook job. Just a simple kustomization.yaml with 2 resources and a postsynchookjob No Helm. The job actually takes about 2 minutes to complete, but argocd only sets the job as finished after approx 10 minutes. Tried different settings (0, 60, 120) on ttlSecondsAfterFinished in the job spec, but no change in behaviour Also monitored the memory and cpu usage of the argocd pods (controller, repo, applicationsetcontroller, etc), no pod even comes close to limitcpu or limitmemory, so no issue there...

We seem to experience a similar situation with the pre and post-sync hooked pods, argocd only sets them finished after circa 10 minutes even though they actually complete earlier.

@travis-jorge
Copy link

this happens for us at least once a day on our db jobs that are pre-sync hooks v2.11.5. Any suggestions on how to determine what is causing this?

@jkleinlercher
Copy link
Contributor

jkleinlercher commented Aug 1, 2024

Since ArgoCD helm chart now has also a webhook for redis we get this problem also with ArgoCD managing ArgoCD. It stays OutOfSync with this

waiting for completion of hook batch/Job/argocd-redis-secret-init

        - group: batch
          hookPhase: Running
          hookType: PreSync
          kind: Job
          message: job.batch/sx-argocd-redis-secret-init created
          name: sx-argocd-redis-secret-init
          namespace: argocd
          syncPhase: PreSync
          version: v1

that is really a problem …

@asaf400
Copy link

asaf400 commented Aug 1, 2024

@jkleinlercher Every now and then this thread makes me laugh out load,
Thanks for this time ♥

It's funny, because (this bug); it's sad and or disappointing.. 😞

#6880 (comment)

@asaf400
Copy link

asaf400 commented Aug 7, 2024

@tico24 Why is the issue closed? it is not resolved..

The fix only implements the workaround discovered here For the ArgoCD chart:

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec.

This issue has been opened for all jobs, from other helm charts and or plain manifests which cannot be modified to have ttlSecondsAfterFinished > 0

@tico24
Copy link
Member

tico24 commented Aug 8, 2024

Because I merged the PR that someone else wrote which has magic words on it so GitHub closed this issue.

@tico24 tico24 reopened this Aug 8, 2024
@asaf400
Copy link

asaf400 commented Aug 8, 2024

@tico24 Sorry I didn't know it was automated, thought it was intentional, Thanks for reopening 🙏

@mqxter
Copy link

mqxter commented Aug 11, 2024

still an issue. hoping for a fix.

@petrlebedev
Copy link

petrlebedev commented Aug 21, 2024

Also same issue with Karpenter helm chart 3.7.1: stucked on: waiting for completion of hook batch/Job/karpenter-staging-post-install-hook https://github.dev/aws/karpenter-provider-aws/tree/main/charts/karpenter
Sad that i can't change chart)
image

any workarounds you can share?

@tico24
Copy link
Member

tico24 commented Aug 21, 2024

While i agree that this argocd issue should be addressed, this is primarily due to Karpenter's (frankly rubbish) implementation. I have been battling errors in the 1.0.0 upgrade for the past two days. Warning: removing Argo from the mix doesn't make things much better.

There as workarounds documented in the issues in the karpenter repo.

msvechla added a commit to msvechla/karpenter-provider-aws that referenced this issue Aug 21, 2024
Fixes an argocd issue where helm hooks never finish syncing when they
have ttlSecondsAfterFinished set to 0.

See related argocd issue: argoproj/argo-cd#6880

Suggested workaround as implemented by argocd team: argoproj/argo-helm#2861
@RevealOscar
Copy link

Had the same issue although for me the service account was not being deleted with the same hooks. My job ttlSecondsAfterFinished was > 0. Unsure if this is the actual root issue but I added this annotations to my job/sa
argocd.argoproj.io/sync-options: PruneLast=false
My reasoning being that the sync wave for pruning was going to be Last under the default. The chart needs the deletion to happen before any resources are upgraded. I have not noticed the issue on installation of charts. Per the argocd docs pruneLast would occur after the other resources have been deployed and become healthy so my assumption is argo holds on to the resource causing the sync to be delayed indefinitely. Been testing for a week without issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:core Syncing, diffing, cluster state cache type:bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.