[CELEBORN-1902] Read client throws PartitionConnectionException #3147

Austinfjq · 2025-03-11T23:34:35Z

What changes were proposed in this pull request?

org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure.

Why are the changes needed?

If org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure.

This PR is to deal with exceptions like:

java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested in a Flink batch job with Celeborn.

SteNicholas · 2025-03-12T02:42:57Z

@Austinfjq, could you explain why this throws PartitionConnectionException? IMO, PartitionConnectionException is used to retry upstream task instead of downstream task for PartitionNotFoundException?

Austinfjq · 2025-03-12T20:24:23Z

@Austinfjq, could you explain why this throws PartitionConnectionException?

I think for this following error, PartitionConnectionException could better describe the error is about connection failure instead of partition not found.

java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924

IMO, PartitionConnectionException is used to retry upstream task instead of downstream task for PartitionNotFoundException?

I might be missing something. Can downstream task be retried? I assume it hasn't started.

Throw PartitionConnectionException

66d7c6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1902] Read client throws PartitionConnectionException #3147

[CELEBORN-1902] Read client throws PartitionConnectionException #3147

Austinfjq commented Mar 11, 2025

SteNicholas commented Mar 12, 2025

Austinfjq commented Mar 12, 2025

[CELEBORN-1902] Read client throws PartitionConnectionException #3147

Are you sure you want to change the base?

[CELEBORN-1902] Read client throws PartitionConnectionException #3147

Conversation

Austinfjq commented Mar 11, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SteNicholas commented Mar 12, 2025

Austinfjq commented Mar 12, 2025