Improving cipher parallelism #444

tarcieri · 2020-12-30T21:06:11Z

Taking a step back from #354, I thought it'd be good to look how and where ILP and SIMD parallelism is currently used across the project as a whole, and how that could be improved.

The only place we presently have any sort of parallelism abstraction at the trait-level is BlockCipher::ParBlocks. Otherwise various crates leverage e.g. SIMD internally. Regarding BlockCipher::ParBlocks specifically, the only crate that leverages it is the aes crate.

The following crates have SIMD backends:

Ciphers

aes
chacha20

UHFs/"MACs"

polyval
poly1305

AEADs

In AEADs, we'd like to glue the above crates together in fairly fixed combinations in order to leverage ILP, passing SIMD buffers from ciphers to UHFs for authentication:

aes-gcm/aes-gcm-siv: aes + ghash/polyval
chacha20poly1305: chacha20 + poly1305

(also aes-siv and pmac, but this is less of a priority)

In either of these cases there's a single specific buffer type I think it'd be nice for both the cipher implementation and UHF to support in common:

aes-gcm/aes-gcm-siv: "i128x8" i.e. [__m128i; 8] on x86/x86_64
chacha20poly1305: "i256x4" i.e. [__m256i; 4] on x86/x86_64

Concrete proposal

My suggestion is to get rid of BlockCipher::ParBlocks and replace it with more general SIMD types and traits designed to work with them, namely:

Add a new utils crate e.g. simd-buffers which provides "i128x8" and "i256x4" SIMD buffer types which are backed by __m128i/__m256i on x86/x86_64 and otherwise provide a portable implementation. These types don't need to implement any sort of arithmetic, just provide wrappers for passing data between SIMD implementations.
Add traits to cipher and universal-hash which operate on SIMD buffers.
Use SIMD buffers types in the implementations of aes-gcm, aes-gcm-siv, and chacha20poly1305

`cipher` API suggestion

I'd suggest adding traits to cipher which use the SIMD buffer types which are useful for both block ciphers and stream ciphers.

I also think it might make sense to use a generic parameter rather than an associated type to permit support for multiple buffer types (e.g. on newer CPUs, "i128x4" might be a better option for AES, but we can support both):

/// Note that for practical purposes, we only need to support block cipher encryption,
/// but there could also be a `BlockDecryptPar` for completeness/consistency.
pub trait BlockEncryptPar<B: SimdBuffer> {
    fn encrypt_par(&self, buffer: &mut B);
}

pub trait StreamCipherPar<B: SimdBuffer> {
    fn try_apply_keystream_par(&mut self, buffer: &mut B) -> Result<(), LoopError>;
}

`universal-hash` API suggestion

pub trait UniversalHashPar<B: SimdBuffer> {
    fn update_par(&mut self, blocks: &B);
}

SIMD `ctr` support

Trying to move the end-user facing aes-ctr types into aes has created a very annoying circular dependency between the block-ciphers and stream-ciphers repo. Furthermore, ctr is quite a bit more general now than what the CTR types in the aes crate provide, and also aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

If we had BlockEncryptPar and StreamCipherPar traits, the ctr crate could glue the two together, accepting a SIMD buffer as input, computing the next buffer of keystream output, and XORing the latter into the former. This would allow ctr to be generally SIMD optimized, and also mean we only have one ctr implementation to worry about instead of a separate one in the AES crate.

The text was updated successfully, but these errors were encountered:

newpavlov · 2020-12-30T22:32:43Z

I would love to remove ParBlocks, but right now I don't see how your proposal will help us to improve our API. If anything, I think it needlessly will expose implementation detail in the form of SIMD types, without solving the core issue: with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX). It was the main stumbling block for my attempt to introduce "core stream API".

I think a better direction will be to design an API which will not expose the number of parallel blocks at type level altogether. Though it probably will make some things more difficult or slightly less efficient, e.g. stacking MAC + cipher algorithms in one-pass fashion.

Note that ctr keeps only one processed block in its state. I think other crates also should not depend on ParBlocks in their states. It may reduce performance a bit if message is processed in very small chunks, but it will improve state size.

created a very annoying circular dependency between the block-ciphers and stream-ciphers repo.

Currently we have to depend on ctr to reduce code duplication around block-buffer. I think it should be fixed by introducing "core stream API" and adding block-buffer wrapper to cipher. This way aes no longer will depend on ctr, thus removing the circular dependency.

aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

It's a temporary state since the flavors were introduced relatively recently.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

No, the main benefit is that it makes it much easier for compiler to keep data in XMM registers without spilling them to stack. At this level it can create a noticeable performance regression.

tarcieri · 2020-12-30T22:49:55Z

I think it needlessly will expose implementation detail in the form of SIMD types

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

I think the best way to avoid that is have native SIMD types that allow e.g. a cipher crate to hand off a SIMD buffer directly to a universal-hash crate.

with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX).

By using a trait with a generic parameter, a cipher or UHF type can implement several SIMD buffer types, with the specific one selected at runtime.

If anything, ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

it much easier for compiler to keep data in XMM registers without spilling them to stack.

I think the SIMD buffer types can provide that as well, particularly if we conditionally provide unsafe functions for operating on them with #[target_feature(enable = "...")] annotations for use in SIMD contexts, with a higher-level crate doing runtime detection of the desired SIMD feature combinations.

newpavlov · 2020-12-31T02:02:46Z

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

The compiler is allowed to remove subsequent load/stores. Though it's indeed will not be easy to design API in such way which will allow compiler to reliably do it.

ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

Can't the same approach be applied to ParBlocks? It also can be a generic parameter instead of associated type.

The problem is that with runtime detection we have two runtime switches in each crate, so compiler can not keep data in registers, since it sees branches on different memory locations, so it can not merge them.

I don't think we can have a reliable solution without an ability to compile dependency crates several times with different feature flags. With such feature we would've been able to push runtime detection as high as possible without sacrificing runtime switch capabilities at lower levels. But unfortunately there is not even pre-RFC for such feature and I think the community and the lang team do not currently have enough interest in improving situation in this area.

tarcieri · 2020-12-31T02:27:18Z

Can't the same approach be applied tol ParBlocks? It also can be a generic parameter instead of associated type.

Yes, but only if there were a separate trait for parallel operations. It wouldn't make sense to e.g. convert the current BlockCipher::ParBlocks associated type into a generic parameter of BlockCipher.

But then the question remains: what makes for a better SIMD buffer, a GenericArray<u8, _>, or dedicated types which can wrap SIMD registers?

I don't think using a GenericArray for this makes sense for a couple reasons:

As noted above, the cardinality of useful parallel buffer types is presently at 2
In general we have SIMD composability problems because all of the current SIMD code is written at a fairly low level, and I think this is due to a lack of shared SIMD types where we can centralize abstractions

Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.

tarcieri · 2021-01-22T18:27:32Z

I opened RustCrypto/utils#221 which contains a WIP prototype of a simd-buffers crate.

Implements the following SIMD types, as proposed in RustCrypto/traits#444: - `U128` (portable) - `U256` (x86/x86_64 only) - `U128x8` (portable) These types are largely "storage only" and don't implement arithmetic (if we needed that, `stdsimd`/`packed_simd` would be a better choice) The implementation *does* expose optimized XOR intrinsics, however, which seems to be the main thing useful in a portable cryptographic context, at least as far as our current usages of SIMD go. The `x86` backend exposes unsafe `target_feature(enable = "...")` functions as part of its API, intended to be used/inlined within SIMD backends for particular algorithms.

Adds traits which are generic around a buffer type, allowing more precise control of inputs to encryption and universal hash algorithms. Potentially resolves #159 and #444.

newpavlov · 2022-03-11T20:14:01Z

I think we can close this issue with cipher v0.4 being released. It uses a different approach than outlined here, but it works good enough.

tarcieri · 2023-05-24T13:00:39Z

I'm reopening this as we continue to get complaints about performance, and attempting to implement one-pass operation for either AES-GCM or ChaCha20Poly1305 does not yield the expected speedups:

tarcieri mentioned this issue Dec 30, 2020

[WIP] cipher: split Encrypt/EncryptPar and Decrypt/DecryptPar #354

Closed

This was referenced Jan 2, 2021

Rage is 38% slower at encrypting than Go implementation str4d/rage#57

Open

cipher: block cipher trait inefficiencies #332

Closed

universal-hash: parallel block processing #289

Open

tarcieri added the cipher Block and stream cipher crate label Jan 4, 2021

newpavlov mentioned this issue Jan 12, 2021

digest: add block-level API #380

Merged

tarcieri mentioned this issue Jan 22, 2021

[WIP] simd-buffers RustCrypto/utils#221

Closed

tarcieri mentioned this issue Feb 2, 2021

cipher v0.3.0 release #392

Closed

2 tasks

This was referenced Feb 5, 2021

[WIP] cipher/universal-hash: generic buffer traits #528

Closed

Recommended AEAD algorithms RustCrypto/AEADs#272

Open

newpavlov closed this as completed Mar 11, 2022

tarcieri mentioned this issue May 24, 2023

aes-gcm: performance is worse than OpenSSL RustCrypto/AEADs#243

Open

tarcieri reopened this May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving cipher parallelism #444

Improving cipher parallelism #444

tarcieri commented Dec 30, 2020 •

edited

Loading

newpavlov commented Dec 30, 2020

tarcieri commented Dec 30, 2020 •

edited

Loading

newpavlov commented Dec 31, 2020

tarcieri commented Dec 31, 2020

tarcieri commented Jan 22, 2021

newpavlov commented Mar 11, 2022

tarcieri commented May 24, 2023

Improving cipher parallelism #444

Improving cipher parallelism #444

Comments

tarcieri commented Dec 30, 2020 • edited Loading

Ciphers

UHFs/"MACs"

AEADs

Concrete proposal

cipher API suggestion

universal-hash API suggestion

SIMD ctr support

newpavlov commented Dec 30, 2020

tarcieri commented Dec 30, 2020 • edited Loading

newpavlov commented Dec 31, 2020

tarcieri commented Dec 31, 2020

tarcieri commented Jan 22, 2021

newpavlov commented Mar 11, 2022

tarcieri commented May 24, 2023

tarcieri commented Dec 30, 2020 •

edited

Loading

`cipher` API suggestion

`universal-hash` API suggestion

SIMD `ctr` support

tarcieri commented Dec 30, 2020 •

edited

Loading