Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1896] delete data from failed to fetch shuffles #3109

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

CodingCat
Copy link
Contributor

@CodingCat CodingCat commented Feb 19, 2025

What changes were proposed in this pull request?

currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles

As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage

this PR implements the mechanism to delete files from failed-to-fetch shuffles

the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage

the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages

Why are the changes needed?

saving disk space

Does this PR introduce any user-facing change?

a new config

How was this patch tested?

we manually delete some files

image

from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers

image

in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0

@CodingCat CodingCat changed the title [WIP] clean failed shuffle disk [CELEBORN-1896] delete data from failed to fetch shuffles Mar 7, 2025
Copy link

codecov bot commented Mar 9, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 32.62%. Comparing base (3a83ac7) to head (1246b4e).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...cala/org/apache/celeborn/common/CelebornConf.scala 85.72% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3109      +/-   ##
==========================================
+ Coverage   32.58%   32.62%   +0.04%     
==========================================
  Files         340      340              
  Lines       20390    20405      +15     
  Branches     1820     1820              
==========================================
+ Hits         6642     6655      +13     
- Misses      13376    13378       +2     
  Partials      372      372              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant