-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving cipher parallelism #444
Comments
I would love to remove I think a better direction will be to design an API which will not expose the number of parallel blocks at type level altogether. Though it probably will make some things more difficult or slightly less efficient, e.g. stacking MAC + cipher algorithms in one-pass fashion. Note that
Currently we have to depend on
It's a temporary state since the flavors were introduced relatively recently.
No, the main benefit is that it makes it much easier for compiler to keep data in XMM registers without spilling them to stack. At this level it can create a noticeable performance regression. |
Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate I think the best way to avoid that is have native SIMD types that allow e.g. a
By using a trait with a generic parameter, a cipher or UHF type can implement several SIMD buffer types, with the specific one selected at runtime. If anything,
I think the SIMD buffer types can provide that as well, particularly if we conditionally provide |
The compiler is allowed to remove subsequent load/stores. Though it's indeed will not be easy to design API in such way which will allow compiler to reliably do it.
Can't the same approach be applied to The problem is that with runtime detection we have two runtime switches in each crate, so compiler can not keep data in registers, since it sees branches on different memory locations, so it can not merge them. I don't think we can have a reliable solution without an ability to compile dependency crates several times with different feature flags. With such feature we would've been able to push runtime detection as high as possible without sacrificing runtime switch capabilities at lower levels. But unfortunately there is not even pre-RFC for such feature and I think the community and the lang team do not currently have enough interest in improving situation in this area. |
Yes, but only if there were a separate trait for parallel operations. It wouldn't make sense to e.g. convert the current But then the question remains: what makes for a better SIMD buffer, a I don't think using a
|
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
I opened RustCrypto/utils#221 which contains a WIP prototype of a |
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.
I think we can close this issue with |
I'm reopening this as we continue to get complaints about performance, and attempting to implement one-pass operation for either AES-GCM or ChaCha20Poly1305 does not yield the expected speedups: |
Taking a step back from #354, I thought it'd be good to look how and where ILP and SIMD parallelism is currently used across the project as a whole, and how that could be improved.
The only place we presently have any sort of parallelism abstraction at the trait-level is
BlockCipher::ParBlocks
. Otherwise various crates leverage e.g. SIMD internally. RegardingBlockCipher::ParBlocks
specifically, the only crate that leverages it is theaes
crate.The following crates have SIMD backends:
Ciphers
aes
chacha20
UHFs/"MACs"
polyval
poly1305
AEADs
In AEADs, we'd like to glue the above crates together in fairly fixed combinations in order to leverage ILP, passing SIMD buffers from ciphers to UHFs for authentication:
aes-gcm
/aes-gcm-siv
:aes
+ghash
/polyval
chacha20poly1305
:chacha20
+poly1305
(also
aes-siv
andpmac
, but this is less of a priority)In either of these cases there's a single specific buffer type I think it'd be nice for both the cipher implementation and UHF to support in common:
aes-gcm
/aes-gcm-siv
: "i128x8" i.e.[__m128i; 8]
on x86/x86_64chacha20poly1305
: "i256x4" i.e.[__m256i; 4]
on x86/x86_64Concrete proposal
My suggestion is to get rid of
BlockCipher::ParBlocks
and replace it with more general SIMD types and traits designed to work with them, namely:utils
crate e.g.simd-buffers
which provides "i128x8" and "i256x4" SIMD buffer types which are backed by__m128i
/__m256i
on x86/x86_64 and otherwise provide a portable implementation. These types don't need to implement any sort of arithmetic, just provide wrappers for passing data between SIMD implementations.cipher
anduniversal-hash
which operate on SIMD buffers.aes-gcm
,aes-gcm-siv
, andchacha20poly1305
cipher
API suggestionI'd suggest adding traits to
cipher
which use the SIMD buffer types which are useful for both block ciphers and stream ciphers.I also think it might make sense to use a generic parameter rather than an associated type to permit support for multiple buffer types (e.g. on newer CPUs, "i128x4" might be a better option for AES, but we can support both):
universal-hash
API suggestionSIMD
ctr
supportTrying to move the end-user facing
aes-ctr
types intoaes
has created a very annoying circular dependency between theblock-ciphers
andstream-ciphers
repo. Furthermore,ctr
is quite a bit more general now than what the CTR types in theaes
crate provide, and alsoaes
doesn't actually provide the CTR "flavors" (Ctr32BE
/Ctr32Le
) needed byaes-gcm
andaes-gcm-siv
.But really, it seems like the main benefit of the implementation in the
aes
crate is being able to use_mm_xor_si128
to XOR a "i128x8" type.If we had
BlockEncryptPar
andStreamCipherPar
traits, thectr
crate could glue the two together, accepting a SIMD buffer as input, computing the next buffer of keystream output, and XORing the latter into the former. This would allowctr
to be generally SIMD optimized, and also mean we only have onectr
implementation to worry about instead of a separate one in the AES crate.The text was updated successfully, but these errors were encountered: