Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix missing vzeroupper in gcm_ghash_vpclmulqdq_avx2() for len=16
[Adapted by Brian Smith to *ring*'s prior modifications.] gcm_ghash_vpclmulqdq_avx2() executes vzeroupper only when len >= 32, as it was supposed to use only xmm registers for len == 16. But actually it was writing to two ymm registers unconditionally for $BSWAP_MASK and $GFPOLY. Therefore, there could be a slow-down in later code using legacy SSE instructions, in the (probably rare) case where gcm_ghash_vpclmulqdq_avx2() was called with len=16 *and* wasn't followed by a function that does vzeroupper, e.g. aes_gcm_enc_update_vaes_avx2(). (The Windows xmm register restore epilogue of gcm_ghash_vpclmulqdq_avx2() itself does use legacy SSE instructions, so probably was slowed down by this, but that's just 8 instructions.) Fix this by updating gcm_ghash_vpclmulqdq_avx2() to correctly use only xmm registers when len=16. This makes it match the similar code in aes-gcm-avx10-x86_64.pl, which does do it correctly, more closely. Also, make both functions just execute vzeroupper unconditionally, so that it won't be missed again. It's actually only 1 cycle on the CPUs this code runs on, and it no longer seems worth executing conditionally. Change-Id: I975cd5b526e5cdae1a567f4085c2484552bf6bea
- Loading branch information