Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1902] Read client throws PartitionConnectionException #3147

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Austinfjq
Copy link

What changes were proposed in this pull request?

org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure.

Why are the changes needed?

If org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure.

This PR is to deal with exceptions like:

java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested in a Flink batch job with Celeborn.

@SteNicholas
Copy link
Member

@Austinfjq, could you explain why this throws PartitionConnectionException? IMO, PartitionConnectionException is used to retry upstream task instead of downstream task for PartitionNotFoundException?

@Austinfjq
Copy link
Author

@Austinfjq, could you explain why this throws PartitionConnectionException?

I think for this following error, PartitionConnectionException could better describe the error is about connection failure instead of partition not found.

java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924

IMO, PartitionConnectionException is used to retry upstream task instead of downstream task for PartitionNotFoundException?

I might be missing something. Can downstream task be retried? I assume it hasn't started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants