[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144

turboFei · 2025-03-11T04:46:40Z

What changes were proposed in this pull request?

On SparkListenerStageCompleted event, check whether the shuffle fetch failure reported in the stage, if that, cancel the running tasks due celeborn client will rerun the whole stage.

Why are the changes needed?

If the task failed due to FetchFailed, dag scheduler would markStageAsFinished.
https://github.com/apache/spark/blob/3a872b7ca11faa128a2667de55f6dca13807057a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2022-L2060

But it will not cancel the running tasks in the stage.

For example, in below stage, a task failed due to fetch failure, and the stage duration is 39s.

However, it does not cancel the running tasks, the launched 2496 tasks keep running and the maximum task duration is 31 minutes.

It wastes a lot of compute resource.

For celeborn shuffle fetch failure, it will rerun the whole stage, so it is fine to cancel all the running tasks.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT & Cluster testing.

The stage terminated quickly.

turboFei · 2025-03-11T05:17:33Z

...ain/scala/org/apache/spark/shuffle/celeborn/ShuffleFetchFailureReportTaskCleanListener.scala

+    if (shuffleFetchFailureTaskIds != null) {
+      shuffleFetchFailureTaskIds.asScala.headOption.foreach { case taskId =>
+        val taskSetManager = SparkUtils.getTaskSetManager(taskId)
+        if (taskSetManager != null && taskSetManager.runningTasks > 0) {


even we do not know whether the shuffleFetchFailureTaskIds related task did trigger the FetchFailedException eventually(depends on whether another task attempt is running or has been finished, see #2921).

But it should be safe to cancel the running tasks after checking the taskSetManager.runningTasks > 0.

turboFei · 2025-03-11T17:20:46Z

cc @FMX @RexXiong @pan3793

turboFei · 2025-03-12T04:39:44Z

cc @SteNicholas

turboFei force-pushed the cancel_tasks branch from f09fbd5 to 58a64ac Compare March 11, 2025 04:52

turboFei marked this pull request as draft March 11, 2025 04:53

cancel running tasks

bf26016

turboFei force-pushed the cancel_tasks branch from 58a64ac to bf26016 Compare March 11, 2025 04:56

turboFei changed the title ~~cancel running tasks~~ [CELEBORN-1904] Cancel the running tasks if the stage is marked as failed due to shuffle fetch failure Mar 11, 2025

turboFei commented Mar 11, 2025

View reviewed changes

kill task

e54f94f

turboFei force-pushed the cancel_tasks branch from 1922279 to e54f94f Compare March 11, 2025 06:47

turboFei added 3 commits March 11, 2025 01:14

fix

7560f83

UT

133d138

fix ut

1d3d341

turboFei marked this pull request as ready for review March 11, 2025 10:05

synchronized

a78cb4e

turboFei changed the title ~~[CELEBORN-1904] Cancel the running tasks if the stage is marked as failed due to shuffle fetch failure~~ [CELEBORN-1904] Cancel the running tasks if the stage need to be rerun Mar 12, 2025

turboFei changed the title ~~[CELEBORN-1904] Cancel the running tasks if the stage need to be rerun~~ [CELEBORN-1904] Cancel the stage running tasks on stage rerun Mar 12, 2025

turboFei requested a review from mridulm March 12, 2025 04:35

turboFei requested review from onebox-li and cxzl25 March 12, 2025 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144

[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144

turboFei commented Mar 11, 2025 •

edited

Loading

turboFei Mar 11, 2025 •

edited

Loading

turboFei commented Mar 11, 2025

turboFei commented Mar 12, 2025

[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144

Are you sure you want to change the base?

[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144

Conversation

turboFei commented Mar 11, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

turboFei Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

turboFei commented Mar 11, 2025

turboFei commented Mar 12, 2025

turboFei commented Mar 11, 2025 •

edited

Loading

turboFei Mar 11, 2025 •

edited

Loading