Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM support #31

Open
sysedwinistrator opened this issue May 10, 2024 · 4 comments
Open

ARM support #31

sysedwinistrator opened this issue May 10, 2024 · 4 comments

Comments

@sysedwinistrator
Copy link

Hi,
this project looks very interesting.
I'm currently using Waypipe to forward Wayland apps running on a NixOS aarch64 host to a NixOS x86-64 client. I guess the more common use case would be the reverse: forwarding apps running on a more powerful x86-64 host to a less powerful ARM SBC.

As I found out while packaging this project with Nix and trying to build it for aarch64-linux, the code only builds for x86-64-v3 as it uses instructions specific to that architecture (SSE2/AVX2).

The SIMD / Architecture-specific instruction stuff is frankly a bit above my paygrade.
But I know that ARM has its own instructions for SIMD. Unfortunately, just like for x86, the set of instructions varies by device and vendor. So for example, an Apple M1 chip (ARMv8.5-A) supports newer SIMD instructions that a Raspberry Pi 5 (ARMv8.2-A) doesn't.

Do you think it would be possible to reimplement the code specific to x86-64-v3 for ARM using some of the ARM SIMD extensions? Alternatively, would it make sense to have a non-optimized, generic implementation that does not rely on any architecture-specific instructions as a fallback?

@nicolasavru
Copy link
Collaborator

nicolasavru commented May 10, 2024

Sorry, I had been meaning to document that and kept forgetting to.

The main culprits are src/prefix_sum.rs and src/transpose.rs. An ARM Neon version of them could definitely be implemented, but it is unlikely to ever be a priority for us. Patches welcome though!

There's also rustflags = ["-C", "target-cpu=x86-64-v3"] in .cargo/config.toml to let the compiler actually emit SIMD instructions for the rest of the code. Some other mechanism of specifying that would be needed if we wanted an ARM version.

Unfortunately a non-SIMD version is unusably slow.

@nicolasavru
Copy link
Collaborator

Another option is adding an option to disable compression and/or to only use zstd directly, without the transpose/prefix_sum transformations to improve the compression (both ratio and speed). That would result in significantly increased bandwidth usage, but might still be worth it to be able to use wprs on ARM at all (until ARM SIMD versions of the functions are implemented).

@nicolasavru
Copy link
Collaborator

Oh, there may also be some endianness issues, but those should be easy to resolve.

@sysedwinistrator
Copy link
Author

Thanks for your quick reply!

An ARM Neon version of them could definitely be implemented, but it is unlikely to ever be a priority for us. Patches welcome though!

I'm probably not up for this task, unfortunately. But it's good to know you'd welcome external contributions should someone more capable implement it.

There's also rustflags = ["-C", "target-cpu=x86-64-v3"] in .cargo/config.toml to let the compiler actually emit SIMD instructions for the rest of the code. Some other mechanism of specifying that would be needed if we wanted an ARM version.

Would generating .cargo/config.tml via a shell script be an acceptable mechanism? It should definitely work for the Nix package, but I don't know how this would affect other packaging systems.

Another option is adding an option to disable compression and/or to only use zstd directly, without the transpose/prefix_sum transformations to improve the compression (both ratio and speed). That would result in significantly increased bandwidth usage, but might still be worth it to be able to use wprs on ARM at all (until ARM SIMD versions of the functions are implemented).

Do you have a rough idea what the bandwith usage would be? Especially in comparison to Waypipe?

Oh, there may also be some endianness issues, but those should be easy to resolve.

That shouldn't be an issue. Waypipe also has the limitation that both systems need to have the same endianness, and it works for my use case. That is because even though ARM CPUs are bi-endian, in the real world Linux on ARM is always little-endian like x86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants