Avoid using YMM registers in `gcm_ghash_vpclmulqdq_avx2_1`. #2470

briansmith · 2025-03-10T18:58:48Z

[Adapted by Brian Smith to ring's prior modifications.]

gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 and wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2().

(The Windows xmm register restore epilogue of
gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.)

Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely.

Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally.

codecov · 2025-03-10T19:47:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.62%. Comparing base (3542fbc) to head (eebf4a6).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2470   +/-   ##
=======================================
  Coverage   96.61%   96.62%           
=======================================
  Files         180      180           
  Lines       21814    21814           
  Branches      539      539           
=======================================
+ Hits        21076    21077    +1     
  Misses        623      623           
+ Partials      115      114    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

briansmith · 2025-03-10T20:05:12Z

@arielb1 @jlizen This may conflict with the thing you're doing.

[Adapted by Brian Smith to *ring*'s prior modifications.] gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 *and* wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2(). (The Windows xmm register restore epilogue of gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.) Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely. Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally. Change-Id: I975cd5b526e5cdae1a567f4085c2484552bf6bea

arielb1 · 2025-03-11T00:28:06Z

@briansmith

My PR works fine on top of the b/xmm branch

briansmith · 2025-03-11T17:18:02Z

I benchmarked the changes to gcm_ghash_vpclmulqdq_avx2_1 with and without the vzeroupper. In both cases, the new version that avoids YMM registers is much, much faster (according to bench/aead.rs) particularly for small inputs.

Mini postmortem: Of course I was curious about whether the vzeroupper was needed when reviewing the original code. But, the negative result was caused by the combination of

upstream accepting it wasn't necessary for this case,
a brain-fart of misunderstanding the "i128" in vbroadcasti128
misunderstanding comments from the author about the code being slower than some other alternatives for small sizes.
upstream not using this function for partial blocks like we do.

Luckily, in this case, the result of the misunderstanding was just a performance issue, not a correctness issue.

briansmith · 2025-03-11T17:19:00Z

Oh yeah, regarding whether it is worthwhile to remove the vzeroupper: I benchmarked with and without it, and benchmarks showed no difference, so I left it, to match upstream.

briansmith changed the title ~~Fix missing vzeroupper in gcm_ghash_vpclmulqdq_avx2() for len=16~~ Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1. Mar 10, 2025

briansmith force-pushed the b/xmm branch from cc91f51 to c6ced74 Compare March 10, 2025 20:01

briansmith force-pushed the b/xmm branch from c6ced74 to ac0c9c5 Compare March 10, 2025 23:24

briansmith force-pushed the b/xmm branch from ac0c9c5 to eebf4a6 Compare March 10, 2025 23:57

briansmith merged commit e98f47f into main Mar 11, 2025
174 checks passed

briansmith deleted the b/xmm branch March 11, 2025 17:18

briansmith added this to the 0.17.14 milestone Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid using YMM registers in `gcm_ghash_vpclmulqdq_avx2_1`. #2470

Avoid using YMM registers in `gcm_ghash_vpclmulqdq_avx2_1`. #2470

briansmith commented Mar 10, 2025

codecov bot commented Mar 10, 2025 •

edited

Loading

briansmith commented Mar 10, 2025

arielb1 commented Mar 11, 2025 •

edited

Loading

briansmith commented Mar 11, 2025 •

edited

Loading

briansmith commented Mar 11, 2025

Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1. #2470

Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1. #2470

Conversation

briansmith commented Mar 10, 2025

codecov bot commented Mar 10, 2025 • edited Loading

Codecov Report

briansmith commented Mar 10, 2025

arielb1 commented Mar 11, 2025 • edited Loading

briansmith commented Mar 11, 2025 • edited Loading

briansmith commented Mar 11, 2025

Avoid using YMM registers in `gcm_ghash_vpclmulqdq_avx2_1`. #2470

Avoid using YMM registers in `gcm_ghash_vpclmulqdq_avx2_1`. #2470

codecov bot commented Mar 10, 2025 •

edited

Loading

arielb1 commented Mar 11, 2025 •

edited

Loading

briansmith commented Mar 11, 2025 •

edited

Loading