-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using YMM registers in gcm_ghash_vpclmulqdq_avx2_1
.
#2470
Conversation
gcm_ghash_vpclmulqdq_avx2_1
.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2470 +/- ##
=======================================
Coverage 96.61% 96.62%
=======================================
Files 180 180
Lines 21814 21814
Branches 539 539
=======================================
+ Hits 21076 21077 +1
Misses 623 623
+ Partials 115 114 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
[Adapted by Brian Smith to *ring*'s prior modifications.] gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 *and* wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2(). (The Windows xmm register restore epilogue of gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.) Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely. Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally. Change-Id: I975cd5b526e5cdae1a567f4085c2484552bf6bea
My PR works fine on top of the b/xmm branch |
I benchmarked the changes to gcm_ghash_vpclmulqdq_avx2_1 with and without the Mini postmortem: Of course I was curious about whether the vzeroupper was needed when reviewing the original code. But, the negative result was caused by the combination of
Luckily, in this case, the result of the misunderstanding was just a performance issue, not a correctness issue. |
Oh yeah, regarding whether it is worthwhile to remove the vzeroupper: I benchmarked with and without it, and benchmarks showed no difference, so I left it, to match upstream. |
[Adapted by Brian Smith to ring's prior modifications.]
gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 and wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2().
(The Windows xmm register restore epilogue of
gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.)
Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely.
Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally.