Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polyval: detect VPCLMULQDQ at runtime #184

Open
tarcieri opened this issue Jul 26, 2023 · 4 comments
Open

polyval: detect VPCLMULQDQ at runtime #184

tarcieri opened this issue Jul 26, 2023 · 4 comments

Comments

@tarcieri
Copy link
Member

As of #44, polyval will compile to VPCLMULQDQ instructions on new enough CPU architectures.

We might be able to use a trick similar to RustCrypto/password-hashes#440 where we detect the relevant CPU features and call a special function annotated with target_feature to ensure it's always used where available.

@newpavlov
Copy link
Member

newpavlov commented Jul 27, 2023

This is reply to this comment.

POLYVAL/GHASH can be broken down into a parallelizable portion and a sequential portion... there's an accumulation of the output that is inherently sequential, but multiplication of the inputs can be performed in parallel.

I am not sure I understand. In our implementation we XOR input block x with inner state y, multiply the XOR result with h, and store the multiplication result in y. I don't see where we can process 4 input blocks at once, which can be done with _mm512_clmulepi64_epi128

Maybe you had Poly1305 in mind?

@tarcieri
Copy link
Member Author

tarcieri commented Jul 27, 2023

We already implement POLYVAL in parallel using ILP. It could use VPCLMULQDQ instead (automatically, when available, as opposed to requiring special RUSTFLAGS)

@newpavlov
Copy link
Member

We process one block at a time. ILP is used only for the 3 _mm_clmulepi64_si128 calls, only 2 of which use the same immediate argument. Here is generated assembly for our current implementation: https://rust.godbolt.org/z/zs1acTozM In my understanding, at most we can explicitly merge 2 CLMUL calls with 0x00 immediate into one _mm256_clmulepi64_epi128 call.

@tarcieri
Copy link
Member Author

The optimization I wanted to explore in this particular issue is to find a way to enable VPCLMULQDQ optimizations without the user having to pass -C target-cpu=skylake as RUSTFLAGS, i.e. by enabling the required features via target_feature(enable = "...")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants