Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1822] Respond to RegisterShuffle with max epoch PartitionLocation to avoid revive #3135

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zaynt4606
Copy link
Contributor

@zaynt4606 zaynt4606 commented Mar 5, 2025

What changes were proposed in this pull request?

LifecycleManager respond to RegisterShuffle with max epoch PartitionLocation.

Why are the changes needed?

Newly spun up executors in a Spark job will still get the partitionLocations with the minEpoch of the celeborn lost worker.

These executors will fail to connect to the lost worker and then call into revive to get the latest PartitionLocation for a given partitionId in ChangePartitionManager.getLatestPartition().

Return with max epoch can reduce such revive requests.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT.

@zaynt4606 zaynt4606 marked this pull request as draft March 6, 2025 01:41
@zaynt4606 zaynt4606 marked this pull request as ready for review March 7, 2025 02:26
Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except a nit

@@ -145,6 +145,7 @@ class LifecycleManager(val appUniqueId: String, val conf: CelebornConf) extends
locations: util.List[PartitionLocation]): Unit = {
val map = latestPartitionLocation.computeIfAbsent(shuffleId, newMapFunc)
locations.asScala.foreach(location => map.put(location.getId, location))
invalidateLatestMaxLocsCache(shuffleId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cache invalidate when location updated, so we can consider extending rpcCacheExpireTime to a longer duration, such as 60s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants