Possible missed vectorization in unrolled_dot#3218
Draft
pdogr wants to merge 15 commits intounicode-org:mainfrom
Draft
Possible missed vectorization in unrolled_dot#3218pdogr wants to merge 15 commits intounicode-org:mainfrom
pdogr wants to merge 15 commits intounicode-org:mainfrom
Conversation
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Contributor
Author
|
Benchmarks for Intel Mac main dynamic feature detection using compile time feature detection |
Member
|
nice! |
Leads to a regression This reverts commit e9d4bd3.
The dot implementation possible cases: - "--target-feature=+avx,+fma" + (x86, x86_64) compiles avx versions for dot_1 and dot_2 [compile time] - "--target-feature=+neon" + aarch64 + little endian compiles neon versions for dot_1 and dot_2 [compile time] - None of the above features enabled - no_std defaults to using unrolled dot versions as runtime feature detection requires "std" [compile time] - if std is enabled, the fastest implementation is assigned to DOT_1_PTR, DOT_2_PTR during initialization, depending on the feature detected defaulting to the unrolled versions. We incur the penalty of accessing once_cell::sync::Lazy each time dot is called.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The dot routine used in
math_helper(before ZeroSlice) does not vectorize https://godbolt.org/z/v6bdroEPr. There are a bunch ofvmulss(multiply scalar single precision) instructions in the asm.llvm complains that the loop cannot be vectorized as floating-point operations are not commtative.
A similar thing occurs with a naive dot product impl https://godbolt.org/z/5G9hMvP63, which also fails to vectorize.
This pr adds an avx dot product using
fmadd(fused multiply add) instructions that leads to a performance improvement on my Mac pro (x86-64) (comparing withc7567d46b (HEAD -> main, origin/main) Bump webpack in /ffi/diplomat/js/examples/wasm-demo (#3199))The test suite run with
RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo test --all-featuresalso passes underexperimental/segmenter.Edit:
Reran the benchmarks
HEAD: using command
cargo bench --all-features -- "lstm"(It seems compiling at HEAD with-Ctarget-cpu=nativelead to a performance regression?)PR: using command
RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo bench --all-features -- "lstm"