Skip to content

switch from async to rayon [v3] #2173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release/v3
Choose a base branch
from
Open

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Apr 22, 2025

This switches from async rust to using rayon for conformance test parallelism. I'm making this PR against FVM v3 because we currently have conformance tests there.

Motivation:

  • Primary: Remove the need for maintainers to understand/work with complex async code.
  • Secondary: Remove async-std (deprecated).

Performance:

  • Startup performance for the conformance tests is significantly slower (something to do with locking when we compile the built-in actors from multiple threads?).
  • Runtime performance appears to be the same.

Overall, the conformance tests go from 6 to 12 seconds which isn't great (2x) but that extra time appears to be entirely "startup" cost and shouldn't scale with the number of tests.

fixes #2144

@github-project-automation github-project-automation bot moved this to 📌 Triage in FilOz Apr 22, 2025
@Stebalien Stebalien requested a review from rvagg April 22, 2025 14:51
@Stebalien Stebalien force-pushed the steb/remove-async-v3 branch from 5396eee to 7b718b8 Compare April 22, 2025 14:53
@Stebalien
Copy link
Member Author

@rvagg if this isn't easier to understand, it's not worth it. I thought it was going to be simpler, but error handling with rayon was surprisingly difficult. It might be better to re-try this with channels and a thread-pool, although error handling will still be tricky.

My issue with async-await is that it comes with a bunch of sharp edges around moving things and the error messages can be cryptic.

@BigLep BigLep moved this from 📌 Triage to 🔎 Awaiting Review in FilOz Apr 29, 2025
let file = File::open(&path)?;
let reader = BufReader::new(file);

// Test vectors have the form:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding how all the code in this function manage to be get deleted in this refactor, what is rayon giving us that makes all of this go away? Or does the deletion come mainly from basic cleanup in the process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code was duplicated from MessageVector::from_file (not sure why). There's also some stuff here that makes it possible to configure test-vector runs with environment variables, but we've never really used them.

@rvagg
Copy link
Member

rvagg commented May 8, 2025

This all makes sense to me and is much simpler, but I'm not seeing the error handling difficulty this introduces, can you highlight that for me because it looks like normal rust shenanigans to my eyes. Is it the need for Counters that's the hassle?

The biggest delta is that looking at the new code I'm having to take for granted that using a MultiEngine with the specified parallelism actually delivers the parallelism we want, whereas with async/await you can see it because it's local. And the performance hit is a concern, but you're verifying somehow that it's all in startup? Is that from logging output that you're seeing that? If we can confirm that it's startup cost then maybe we're not properly memoising multi-threaded compile and that sounds like an issue we can deal separately to this, then I'm happy with this change.

@Stebalien
Copy link
Member Author

This all makes sense to me and is much simpler, but I'm not seeing the error handling difficulty this introduces, can you highlight that for me because it looks like normal rust shenanigans to my eyes. Is it the need for Counters that's the hassle?

It's normal rust shenanigans. My hope was that rayon would have "try" versions of all their operations that could fail the pipeline early, but no such luck.

The other way to do this is to manually spin up a worker pool and feed the workers with channels (go style). But I'm not sure if that'll be any better.

The biggest delta is that looking at the new code I'm having to take for granted that using a MultiEngine with the specified parallelism actually delivers the parallelism we want, whereas with async/await you can see it because it's local.

I'm not sure I understand:

  1. Before, we constructed a global multi-engine with a specified parallelism.
  2. Now we construct a local multi-engine with the parallelism auto-detected by rayon (usually the number of CPU threads).

And the performance hit is a concern, but you're verifying somehow that it's all in startup? Is that from logging output that you're seeing that?

Yeah, I verified it by logging. We get stuck compiling wasm modules then blow past once we're done.

If we can confirm that it's startup cost then maybe we're not properly memoising multi-threaded compile and that sounds like an issue we can deal separately to this, then I'm happy with this change.

I dug into this more and the issue is that:

  1. Wasmtime is also using rayon to parallelized compilation of single modules.
  2. We're blocking all of the threads in the rayon thread pool waiting on wasmtime to finish compiling the wasm modules.

This is apparently a well-known rayon issue with no good fix. I've tried some nasty yield hacks but... I only managed to shave off 2 seconds when I should be able to shave off 6.

Let's chat about this in person. I think we may just need to shelve this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎 Awaiting Review
Development

Successfully merging this pull request may close these issues.

2 participants