Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#291] feat(client): Introduce PrefetchableClientReadHandler to support async read #2365

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zuston
Copy link
Member

@zuston zuston commented Feb 8, 2025

What changes were proposed in this pull request?

  1. Introduce PrefetchableClientReadHandler to support async read. And this will be disabled by default.
  2. Apply for the memory/localfile/hdfs read handler

Why are the changes needed?

Recently I found some important spark jobs are slow due to the lots of shuffle read operations. If we could support async read, the job's performance will be improved.

So this PR is the callback for #291. almost 3 years ago!

Does this PR introduce any user-facing change?

Yes. Some configs are introduced

  1. rss.client.read.prefetch.enabled
  2. rss.client.read.prefetch.capacity
  3. rss.client.read.prefetch.timeoutSec

How was this patch tested?

  1. Unit tests

@zuston
Copy link
Member Author

zuston commented Feb 8, 2025

cc @jerqi

@jerqi
Copy link
Contributor

jerqi commented Feb 8, 2025

  1. What will it happen if the operation of read fails? Will it break the order of the data? Because it maybe ordered if mr sort data in the server side.

@zuston
Copy link
Member Author

zuston commented Feb 8, 2025

  1. What will it happen if the operation of read fails? Will it break the order of the data? Because it maybe ordered if mr sort data in the server side.

No, the real concurrency number is 1.

Copy link

github-actions bot commented Feb 8, 2025

Test Results

 2 683 files   -   298   2 683 suites   - 298   5h 52m 50s ⏱️ - 35m 34s
 1 105 tests +    4   1 101 ✅ +    2   2 💤 ±0  1 ❌ +1  1 🔥 +1 
12 202 runs   - 1 587  12 165 ✅  - 1 594  30 💤 ±0  2 ❌ +2  5 🔥 +5 

For more details on these failures and errors, see this check.

Results for commit 0e3729a. ± Comparison against base commit e5cfc4a.

♻️ This comment has been updated with latest results.

@jerqi
Copy link
Contributor

jerqi commented Feb 8, 2025

Could you add document for these config options?
Do we have read timeout option? What's the relationship between it and prefetchTimeout?

@zuston
Copy link
Member Author

zuston commented Feb 8, 2025

What's the relationship between it and prefetchTimeout?

If the async fetch costs too much time, I hope the sync wait could recognize this case and then fast fail

Junfan Zhang added 2 commits February 8, 2025 15:37
@jerqi
Copy link
Contributor

jerqi commented Feb 8, 2025

What's the relationship between it and prefetchTimeout?

If the async fetch costs too much time, I hope the sync wait could recognize this case and then fast fail

PrefetchTimeout may be equal to read time out. Isn't right? They are all the one time read max time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants