Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow continues running after error occurred #13986

Open
2 of 4 tasks
jingkkkkai opened this issue Dec 11, 2024 · 3 comments
Open
2 of 4 tasks

Workflow continues running after error occurred #13986

jingkkkkai opened this issue Dec 11, 2024 · 3 comments
Labels
area/controller Controller issues, panics type/bug

Comments

@jingkkkkai
Copy link
Contributor

jingkkkkai commented Dec 11, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

In our environment, we encountered an issue where the workflow continues running after an error occurred.

截圖 2024-12-11 晚上7 25 14

As shown in the image, the pod node encountered an error, but the step node shows success.
The workflow continues running and causes an unexpected result.
(on the right is a normal error situation)

After some investigations, we found that when workflow-controller is busy, it will cause the informer to be inconsistent with the cluster.
this causes lots of 409 warning logs:

Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \\\""my-another-workflow0vcmln\\\"": the object has been modified; please apply your changes to the latest version and try again Conflict

It also causes the event processing of workflows to be messed up, and it looks to be the root cause of the issue we encountered
It is difficult to reproduce, can anyone help with this issue?

timeline:

A: the step occurred error
B: the step after A

09:35:39.062 -> A init Pending 
09:35:39.073 -> A create pod
09:35:42.109 -> A update to phase Pending (PodInitializing)
09:35:48:943 -> A update to phase Running
09:36:46:040 -> A update to phase Succeeded
09:36:49:260 -> A update to phase Succeeded
09:36:49.279 -> B init Pending
09:36:49.493 -> B create pod
09:36:52.382 -> B update phase to pending (PodInitializing)
09:36:59:301 -> A Pending (PodInitializing)
09:37:02:301 -> A update to phase Running
09:37:05:430 -> A update to phase Running
09:37:08.673 -> A update to phase Failed
09:38:15.793 -> A update to phase Failed
09:38:15.793 -> B update phase to Succeeded

workflow-controller related logs:

""log"":""time=\""2024-11-22T09:35:39.062Z\"" level=info msg=\""Pod node my-workflowddvp9-920151589 initialized Pending\"" namespace=my-namespace workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:35:39.073Z\"" level=info msg=\""Created pod: my-workflowddvp9[5].prepare-s3-table[0].from-parameter[0].run-from-artifact(0) (my-workflowddvp9-920151589)\"" namespace=my-namespace workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:35:42.109Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message= old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"
""pod-template-hash"":""64974ffdb6""},""master_url"":""https://10.222.0.1:443/api""},""log"":""time=\""2024-11-22T09:35:48.943Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Running new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:46.040Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Succeeded new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:49.260Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Succeeded new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:49.279Z\"" level=info msg=\""Pod node my-workflowddvp9-2390334677 initialized Pending\"" namespace=my-namespace workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:49.493Z\"" level=info msg=\""Created pod: my-workflowddvp9[6].metric-csv-file[0].prepare-sql[1].execute-script (my-workflowddvp9-2390334677)\"" namespace=my-namespace workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:52.382Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=my-workflowddvp9-2390334677 old.message= old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:36:59.301Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message= old.phase=Running old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:37:02.301Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Running new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message= old.phase=Running old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:37:05.430Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Running new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message= old.phase=Running old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:37:08.673Z\"" level=info msg=\""Pod failed: Error (exit code 3)\"" displayName=\""run-from-artifact(0)\"" namespace=my-namespace pod=my-workflowddvp9-920151589 templateName=from-artifact workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:37:08.673Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message=\""Error (exit code 3)\"" new.phase=Failed new.progress=0/1 nodeID=my-workflowddvp9-920151589 old.message= old.phase=Running old.progress=0/1 workflow=my-workflowddvp9""}"
""log"":""time=\""2024-11-22T09:38:15.793Z\"" level=info msg=\""node changed\"" namespace=my-namespace new.message= new.phase=Succeeded new.progress=0/1 nodeID=my-workflowddvp9-2390334677 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=my-workflowddvp9""}"

Version(s)

v3.4.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
@isubasinghe
Copy link
Member

@jingkkkkai Given that we only support the last two minor versions (3.6.x and 3.5.y), could you please confirm if this bug exists on 3.5 or 3.6? Thanks

@tooptoop4
Copy link
Contributor

where @jingkkkkai

@jingkkkkai
Copy link
Contributor Author

We will need to discuss whether it is appropriate to upgrade argo-workflows in our environment to the new version

@shuangkun shuangkun added the area/controller Controller issues, panics label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

No branches or pull requests

4 participants