Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use embedded broadcast to replicate constants for AVX512. #147

Closed
wants to merge 2 commits into from

Conversation

Shark64
Copy link
Contributor

@Shark64 Shark64 commented May 23, 2024

Here is the minimal pull request. The only thing I've sneaked in this is switching `cmp reg, 0' to 'test reg, reg': while it's less important than years ago when only test/jcc was macrofused, it's still a byte shorter and looked ugly ;). Shouldn't create any problem, i think :D .

Copy link
Contributor

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, thanks for the work!

dd 0x90befffa
dd 0xa4506ceb
dd 0xbef9a3f7
dd 0xc67178f2


PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as sha1_mb, these 2 need to be aligned to 64 bytes

sha1_mb/sha1_mb_x16_avx512.asm Show resolved Hide resolved
K00_19: dd 0x5A827999
K20_39: dd 0x6ED9EBA1
K40_59: dd 0x8F1BBCDC
K60_79: dd 0xCA62C1D6

PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be aligned to 64 bytes (same for MASK2 below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.

The problem is not the split-load. I got a seg fault because there are aligned load instructions that need this data to be 64-byte aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's interesting, i didn't get any SEGFAULT when doing `make test'. Anyway now they should be aligned, is it working ok for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now yes :) Will merge this soon, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if you want i can also make a quick separate pull requests to use VPTERNLOG in SM3 for the boolean functions. It's such a nice instruction that it's a shame not to use it wherever possible ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you promise it's a short one that you can send in the next few hours :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try :P

avoid cacheline splits for 64bytes loads.

Signed-off-by: Nicola Torracca <[email protected]>
@pablodelara
Copy link
Contributor

Code is now merged, thanks for the work @Shark64!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants