-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1904] Cancel the stage running tasks on stage rerun #3144
base: main
Are you sure you want to change the base?
Conversation
if (shuffleFetchFailureTaskIds != null) { | ||
shuffleFetchFailureTaskIds.asScala.headOption.foreach { case taskId => | ||
val taskSetManager = SparkUtils.getTaskSetManager(taskId) | ||
if (taskSetManager != null && taskSetManager.runningTasks > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even we do not know whether the shuffleFetchFailureTaskIds related task did trigger the FetchFailedException eventually(depends on whether another task attempt is running or has been finished, see #2921).
But it should be safe to cancel the running tasks after checking the taskSetManager.runningTasks > 0
.
cc @SteNicholas |
What changes were proposed in this pull request?
On
SparkListenerStageCompleted
event, check whether the shuffle fetch failure reported in the stage, if that, cancel the running tasks due celeborn client will rerun the whole stage.Why are the changes needed?
If the task failed due to FetchFailed, dag scheduler would
markStageAsFinished
.https://github.com/apache/spark/blob/3a872b7ca11faa128a2667de55f6dca13807057a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2022-L2060
But it will not cancel the running tasks in the stage.
For example, in below stage, a task failed due to fetch failure, and the stage duration is 39s.

However, it does not cancel the running tasks, the launched 2496 tasks keep running and the maximum task duration is 31 minutes.
It wastes a lot of compute resource.
For celeborn shuffle fetch failure, it will rerun the whole stage, so it is fine to cancel all the running tasks.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT & Cluster testing.
The stage terminated quickly.
