Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add x86 SIMD optimizations to crypto datatypes #507

Merged
merged 1 commit into from
Mar 14, 2023

Conversation

Lastique
Copy link
Contributor

@Lastique Lastique commented Sep 21, 2020

  • The v128 operations are optimized for SSE2/SSSE3/SSE4.1.
  • srtp_octet_string_is_eq is optimized for SSE2. When SSE2 is not available, use a pair of 32-bit accumulators to speed up the bulk of the operation. We use two accumulators to leverage instruction-level parallelism supported by most modern CPUs.
  • In srtp_cleanse, use memset and ensure it is not optimized away with a dummy asm statement, which can potentially consume the contents of the memory.
  • Endian conversion functions use gcc-style intrinsics, when possible.
  • In base64_block_to_octet_triple, prefer memchr to strchr as it explicitly accepts the string length, which is known at compile time.

The SIMD code uses intrinsics, which are available on all modern compilers. For MSVC, config_in_cmake.h is modified to define gcc/clang-style SSE macros based on MSVC predefined macros. We enable all SSE versions when it indicates that AVX is enabled. SSE2 is always enabled for x86-64 or for x86 when SSE2 FP math is enabled.

@pabuhler
Copy link
Member

pabuhler commented Oct 8, 2020

Hi, @Lastique thanks for these.

I guess these changes are based on real world usage, do you have any benchmarks on the speed improvements for a given configuration & platform?

@Lastique
Copy link
Contributor Author

Lastique commented Oct 8, 2020

I have attached a unit benchmark to #508, but I don't have one at hand for the changes in this PR. I did bench the change when I originally wrote this patch for our in-house use a few years back, but unfortunately I didn't save it. I suppose, I could write one anew, if required.

Basically, these two PRs came out of our attempts to reduce libsrtp entries in the profiling reports for our WebRTC SFU software. The software maintains lots of SRTP connections (hundreds), and these changes allowed to save a few percents of total CPU usage. The most CPU consuming part in this PR that came up in our reports is bitvector_left_shift, the other places are mostly where I saw an easy opportunity to optimize.

@Lastique
Copy link
Contributor Author

Rebased on top of the current master.

@pabuhler
Copy link
Member

pabuhler commented Mar 1, 2023

@Lastique thanks for updating, will take a new look at this now

- The v128 operations are optimized for SSE2/SSSE3.
- srtp_octet_string_is_eq is optimized for SSE2. When SSE2 is not
  available, use a pair of 32-bit accumulators to speed up the
  bulk of the operation. We use two accumulators to leverage
  instruction-level parallelism supported by most modern CPUs.
- In srtp_cleanse, use memset and ensure it is not optimized away
  with a dummy asm statement, which can potentially consume the
  contents of the memory.
- Endian conversion functions use gcc-style intrinsics, when possible.

The SIMD code uses intrinsics, which are available on all modern compilers.
For MSVC, config_in_cmake.h is modified to define gcc/clang-style SSE macros
based on MSVC predefined macros. We enable all SSE versions when it
indicates that AVX is enabled. SSE2 is always enabled for x86-64 or for x86
when SSE2 FP math is enabled.
@Lastique Lastique force-pushed the optimize_datatypes branch from 52d61dd to faa7998 Compare March 5, 2023 14:51
Copy link
Member

@pabuhler pabuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait a few more day before merging in case any one has an objection.

@@ -272,9 +480,14 @@ int srtp_octet_string_is_eq(uint8_t *a, uint8_t *b, int len)

void srtp_cleanse(void *s, size_t len)
{
#if defined(__GNUC__)
memset(s, 0, len);
__asm__ __volatile__("" : : "r"(s) : "memory");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a little unrelated to the rest ... unless it was performance problem.

@pabuhler pabuhler merged commit 398f299 into cisco:main Mar 14, 2023
@pabuhler
Copy link
Member

thanks for the pr @Lastique , and sorry it so long to get in

@Lastique Lastique deleted the optimize_datatypes branch March 14, 2023 19:55
@Lastique
Copy link
Contributor Author

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants