Use embedded broadcast to replicate constants for AVX512. #147

Shark64 · 2024-05-23T14:32:33Z

Here is the minimal pull request. The only thing I've sneaked in this is switching `cmp reg, 0' to 'test reg, reg': while it's less important than years ago when only test/jcc was macrofused, it's still a byte shorter and looked ugly ;). Shouldn't create any problem, i think :D .

Signed-off-by: Nicola Torracca <[email protected]>

pablodelara

A few comments, thanks for the work!

pablodelara · 2024-05-23T15:23:17Z

sha256_mb/sha256_mb_x16_avx512.asm

+	dd	0x90befffa
+	dd	0xa4506ceb
+	dd	0xbef9a3f7
+	dd	0xc67178f2


 PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b


Same as sha1_mb, these 2 need to be aligned to 64 bytes

sha1_mb/sha1_mb_x16_avx512.asm

pablodelara · 2024-05-23T15:24:05Z

sha1_mb/sha1_mb_x16_avx512.asm

+K00_19:			dd 0x5A827999
+K20_39:                 dd 0x6ED9EBA1
+K40_59:                 dd 0x8F1BBCDC
+K60_79:                 dd 0xCA62C1D6

 PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b


This needs to be aligned to 64 bytes (same for MASK2 below)

I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.

I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.

The problem is not the split-load. I got a seg fault because there are aligned load instructions that need this data to be 64-byte aligned.

Oh, that's interesting, i didn't get any SEGFAULT when doing `make test'. Anyway now they should be aligned, is it working ok for you?

Now yes :) Will merge this soon, thanks!

BTW, if you want i can also make a quick separate pull requests to use VPTERNLOG in SM3 for the boolean functions. It's such a nice instruction that it's a shame not to use it wherever possible ;)

If you promise it's a short one that you can send in the next few hours :P

I'll give it a try :P

avoid cacheline splits for 64bytes loads. Signed-off-by: Nicola Torracca <[email protected]>

pablodelara · 2024-05-24T15:23:04Z

Code is now merged, thanks for the work @Shark64!

Use embedded broadcast to replicate constants for AVX512.

f884233

Signed-off-by: Nicola Torracca <[email protected]>

pablodelara reviewed May 23, 2024

View reviewed changes

Added missing `vpbroadcast' in sha1_mb_x16_avx512.asm; align data to

e292e62

avoid cacheline splits for 64bytes loads. Signed-off-by: Nicola Torracca <[email protected]>

pablodelara closed this May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use embedded broadcast to replicate constants for AVX512. #147

Use embedded broadcast to replicate constants for AVX512. #147

Shark64 commented May 23, 2024 •

edited

Loading

pablodelara left a comment

pablodelara May 23, 2024

pablodelara May 23, 2024

Shark64 May 23, 2024

pablodelara May 24, 2024

Shark64 May 24, 2024

pablodelara May 24, 2024

Shark64 May 24, 2024

pablodelara May 24, 2024

Shark64 May 24, 2024

pablodelara commented May 24, 2024

Use embedded broadcast to replicate constants for AVX512. #147

Use embedded broadcast to replicate constants for AVX512. #147

Conversation

Shark64 commented May 23, 2024 • edited Loading

pablodelara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablodelara commented May 24, 2024

Shark64 commented May 23, 2024 •

edited

Loading