Skip to content

Possible missed vectorization in unrolled_dot#3218

Draft
pdogr wants to merge 15 commits intounicode-org:mainfrom
pdogr:sse-dot
Draft

Possible missed vectorization in unrolled_dot#3218
pdogr wants to merge 15 commits intounicode-org:mainfrom
pdogr:sse-dot

Conversation

@pdogr
Copy link
Copy Markdown
Contributor

@pdogr pdogr commented Mar 23, 2023

The dot routine used in math_helper (before ZeroSlice) does not vectorize https://godbolt.org/z/v6bdroEPr. There are a bunch of vmulss(multiply scalar single precision) instructions in the asm.
llvm complains that the loop cannot be vectorized as floating-point operations are not commtative.

A similar thing occurs with a naive dot product impl https://godbolt.org/z/5G9hMvP63, which also fails to vectorize.

This pr adds an avx dot product using fmadd(fused multiply add) instructions that leads to a performance improvement on my Mac pro (x86-64) (comparing with c7567d46b (HEAD -> main, origin/main) Bump webpack in /ffi/diplomat/js/examples/wasm-demo (#3199))

Line Break/UTF8/Th/lstm time:   [451.82 µs 454.34 µs 457.04 µs]
                        change: [-4.9073% -3.7435% -2.4420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [452.50 µs 456.03 µs 459.75 µs]
                        change: [-4.8695% -3.6713% -2.5002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

The test suite run with RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo test --all-features also passes under experimental/segmenter.

Edit:
Reran the benchmarks

HEAD: using command cargo bench --all-features -- "lstm" (It seems compiling at HEAD with -Ctarget-cpu=native lead to a performance regression?)

PR: using command RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo bench --all-features -- "lstm"

@robertbastian robertbastian added the C-segmentation Component: Segmentation label Apr 13, 2023
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • components/segmenter/src/complex/lstm/matrix.rs is now changed in the branch
  • experimental/segmenter/src/lib.rs is no longer changed in the branch
  • experimental/segmenter/src/line.rs is no longer changed in the branch
  • experimental/segmenter/src/math_helper.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@pdogr
Copy link
Copy Markdown
Contributor Author

pdogr commented May 23, 2023

Benchmarks for Intel Mac

main cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [400.98 µs 404.45 µs 408.42 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [396.37 µs 400.42 µs 405.52 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

dynamic feature detection using OnceCell cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [367.98 µs 370.66 µs 373.47 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Line Break/UTF16/Th/lstm
                        time:   [367.89 µs 370.54 µs 373.36 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

compile time feature detection RUSTFLAGS="-C opt-level=3 -C target-feature=+avx,+fma" cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [350.15 µs 352.51 µs 355.14 µs]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [351.04 µs 354.48 µs 358.58 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

@sffc
Copy link
Copy Markdown
Member

sffc commented May 23, 2023

nice!

pdogr added 9 commits May 24, 2023 10:03
Leads to a regression

This reverts commit e9d4bd3.
The dot implementation possible cases:
- "--target-feature=+avx,+fma" + (x86, x86_64) compiles avx versions for
  dot_1 and dot_2 [compile time]
- "--target-feature=+neon" + aarch64 + little endian compiles neon
  versions for dot_1 and dot_2 [compile time]
- None of the above features enabled
 - no_std defaults to using unrolled dot versions as runtime feature
   detection requires "std" [compile time]
 - if std is enabled, the fastest implementation is assigned to
   DOT_1_PTR, DOT_2_PTR during initialization, depending on the feature
   detected defaulting to the unrolled versions. We incur the penalty of
   accessing once_cell::sync::Lazy each time dot is called.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 29, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-segmentation Component: Segmentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants