Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Runtime detection, take 2 #86

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

aumetra
Copy link
Contributor

@aumetra aumetra commented May 25, 2024

What type of PR is this?

feat: A new feature

Check the PR title.

  • This PR title match the format: <type>(optional scope): <description>
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:

This PR adds runtime detection of SIMD features but, unlike in #55, not on the level of SIMD instructions, but instead implements enum dispatch over multiple inner parsers that each either use AVX2, SSE2, or NEON (or the scalar fallback).

(Optional) Which issue(s) this PR fixes:

Closes #14

(optional) The PR that updates user documentation:

@aumetra
Copy link
Contributor Author

aumetra commented May 25, 2024

This isn't conditionally doing the improvements for the NEON backend via the NeonBits struct yet. Thinking about also adding support for that.

@liuq19
Copy link
Collaborator

liuq19 commented May 30, 2024

Thanks a lot, i need time to review this

@aumetra
Copy link
Contributor Author

aumetra commented Jun 1, 2024

That commit should get rid of a bunch of compile issues related to adding generic types to structs that don't take them.
I was testing this on an x86 system, not passing any compiler options that would enable architecture-specific optimizations, meaning it only compiled with SSE2 support. I didn't even get to see the errors.

@aumetra
Copy link
Contributor Author

aumetra commented Jun 3, 2024

I'm still trying to figure out what the best way forward is to make to_bitmask64 on NEON CPUs work.

@liuq19
Copy link
Collaborator

liuq19 commented Jun 25, 2024

Thanks a lot, I will review the PR this week. I will benchmark the performance at first~

@aumetra aumetra force-pushed the runtime-detection branch from 9e6e4a6 to 7a3a864 Compare July 12, 2024 13:23
@aumetra
Copy link
Contributor Author

aumetra commented Aug 20, 2024

@liuq19 Would you be up to benching the speed of an implementation for NEON that runtime dispatches the bitmask creation? We could technically cache the result whether NEON (or any other feature, really) is supported in globals. That way the performance loss shouldn't be too bad.

Because hacking in runtime dispatch for bitmask creation otherwise is really tricky.

@aumetra
Copy link
Contributor Author

aumetra commented Aug 20, 2024

Hacked in the version that dispatches on each bitmask call. Maybe the performance hit is too severe to justify..

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

Okay, I'm not sure why this is broken? It's on ARM64, right?
I guess I'll have to whip out the cross-compilation for now. I don't own a suitable ARM machine for testing.

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

Now I just need to find a way to properly express this in trait form, preferrably very generic.

@liuq19
Copy link
Collaborator

liuq19 commented Aug 21, 2024

I benched in x86 and maybe the simd is not work in runtime-detect. There maybe some function missed and not working in simd

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-016910da72a3ea13)
twitter/sonic_rs::from_slice_unchecked
                        time:   [861.50 µs 862.23 µs 863.02 µs]
                        change: [+76.254% +76.499% +76.735%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [885.21 µs 885.86 µs 886.62 µs]
                        change: [+73.328% +73.607% +73.868%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [1.6269 ms 1.6298 ms 1.6327 ms]
                        change: [+71.170% +71.493% +71.800%] (p = 0.00 < 0.05)
                        Performance has regressed.
citm_catalog/sonic_rs::from_slice
                        time:   [1.6559 ms 1.6572 ms 1.6587 ms]
                        change: [+67.728% +67.938% +68.127%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

canada/sonic_rs::from_slice_unchecked
                        time:   [4.5318 ms 4.5332 ms 4.5349 ms]
                        change: [+20.602% +20.659% +20.716%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
canada/sonic_rs::from_slice
                        time:   [4.5934 ms 4.5951 ms 4.5970 ms]
                        change: [+20.546% +20.602% +20.659%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

That's weird. On my local machine, the change is somewhat in the ballpark of ~3-4%, which is acceptable (I'd need to profile it to get a closer idea of what's going on; where the performance is lost. Maybe some optimization opportunities that are too opaque for the compiler with all the generics)

     Running benches/deserialize_struct.rs (target/release/deps/deserialize_struct-b15a6d4de21d32b1)
Gnuplot not found, using plotters backend
twitter/sonic_rs::from_slice_unchecked
                        time:   [438.48 µs 440.11 µs 442.07 µs]
                        change: [+4.0017% +4.6062% +5.2032%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
twitter/sonic_rs::from_slice
                        time:   [445.74 µs 447.52 µs 449.94 µs]
                        change: [+3.0931% +3.6056% +4.2171%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

citm_catalog/sonic_rs::from_slice_unchecked
                        time:   [856.18 µs 856.77 µs 857.50 µs]
                        change: [+1.2295% +1.3723% +1.5158%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

@liuq19
Copy link
Collaborator

liuq19 commented Aug 21, 2024

could you remove or comment the config in .cargo/config.toml

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

Already wasn't active due to my global Cargo config. But for the benches above I set -C target-cpu=native. I can somewhat reproduce your findings when I toggle -C target-cpu=native for the main branch, and disable it for the runtime detection branch. But that ignores that all the optimizations the Rust compiler by default doesn't do if it isn't aware of the target CPU model.

I added debug statements and the runtime correctly detects that my CPU supports AVX2, with and without target-cpu=native.

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

So it is much slower without target-cpu=native but that is to be expected with all the CPU-specific optimizations LLVM can do. But the runtime detection only sets performance back by under 5%, which is IMO acceptable as an opt-in feature.

@aumetra
Copy link
Contributor Author

aumetra commented Aug 21, 2024

Never mind, I get what you mean. Let me look into it.

@liuq19
Copy link
Collaborator

liuq19 commented Aug 22, 2024

maybe we can try to compare more benchmarks

};

use super::{Mask, Simd};
use crate::impl_lanes;

#[inline]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need the optimization. The generated asm from std::arch::is_x86_feature_detected has optimizations.
https://rust.godbolt.org/z/sdqefTPxW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That's nice to know. I'll be reverting that then

@@ -81,7 +81,8 @@ name = "value_operator"
harness = false

[features]
default = []
default = ["runtime-detection"]
Copy link
Collaborator

@liuq19 liuq19 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the runtime detection always has fewer overheads, I think it is better not to enable the feature in the default

@liuq19
Copy link
Collaborator

liuq19 commented Sep 3, 2024

any updates?

@aumetra
Copy link
Contributor Author

aumetra commented Sep 3, 2024

Sorry, I've been busy the last two weeks, but I can hopefully do some work today at the airport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

support CPU feature detection and dispatch in runtime
2 participants