-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144
Conversation
LGTM |
I would appreciate if you don't comment commits you didn't review @HeronErin . This literally deletes 6 intrinsics so it cannot be merged as is. |
Will address the issues today probably. Almost everything I've done with emmintrin was really bad and presumably I was under a dark wizard's mind control when I wrote it. |
Comment out some inline asm stuff due to tests miraculously failing Add _mm_stream_load_si128nt Add _mm256_stream_load_si256nt
May or may not have erased previous 28 commits with a force push but fixes have been implemented... I believe the test fails on LDC release were due to some weird shenanigans with inline assembly so I had to comment that out, unfortunate but some time I'll have to look into why that was happening as it also happened for Edit Weird shenanigans may have been VEX instructions wanting me to have the return symbol as the destination, which makes sense in hindsight. Has been fixed alongside a few new changes. |
Add _mm256_permute4x64_epi64 Fix the inline asm issue Mild optimization for bslli & bsrli mask generation Add some function attributes
When test fails with optimization and there is assembly, it usually means the assembly was actually wrong and doesn't preserve registers correctly. In many many cases, there is a Inline IR or builtin or sequence of code to avoid the assembly. And yes I'm not sure it even work for all targets of x86 / combination of flags. |
I avoid writing D's agnostic inline assembly but if you're aware of a case in which something like cast(__m256i)__asm!(long4)("
vpermq $2, $1, $0"
, "=v,v,n", a, IMM8); won't generate properly on LDC with AVX2 then I'll sink some hours into finding a higher level way to do it, presumably with The problem with unittests failing is fixed and I'm guessing it was because optimizations were leading to the first operand being contaminated. |
Yes, saw the inline asm changing! It will probably be ok. |
Your |
Commit history was wiped because I force pushed to master but these changes are in effect:
I've also:
|
Ah yes my bad. I'm working on something else and will review/merge in the coming week, please hold on. |
Add `_mm_adds_epi32` (bonus) Add `_mm_dpbusds_epi32` Move AVX512 to a new folder containing the feature intrinsics Make AVX512 intrinsics marked `nothrow` and `@nogc` Remove some comments that are no longer relevant
OK this is merge day, this will be merged piece by piece on master it's easier to review and change that way. Hence this PR will not get pull as is, but the content should be about the same. EDIT: I'm sorry this stuff makes me angry |
In this case, Intel has go and changed the signature to |
/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
// PERF: ARM64 should use 2x vqadd_s32
static if (LDC_with_saturated_intrinsics)
return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
else
{
__m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
__m128i res = _mm_add_epi32(a, b);
__m128i sign_bit = _mm_srli_epi32(a, 31);
__m128i sign_xor = _mm_xor_si128(a, b);
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
return cast(__m128i) _mm_blendv_ps(cast(__m128)res, // No CT check here
cast(__m128)saturated,
cast(__m128)overflow);
}
} Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available. |
Opened #145 to keep track of all remaining review and merging, it's very detailed work as you've seen |
This is a |
auto hi = _mm_slli_si128!CNT(_mm256_extractf128_si256!0(a));
auto lo = _mm_slli_si128!CNT(_mm256_extractf128_si256!1(a));
return _mm256_setr_m128i(hi, lo); Beware double inversion here:
|
When you don't know how an intrinsics should be implemented in LDC, you can look at: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h For example here: |
The reason I did the |
Yeah I figured |
Absolutely. pros of LLVM asm :
cons of LLVM asm:
|
That one is interesting. The So you could shift by say, 78 bits. However when implemented: __m128i _mm_sllv_epi32(__m128i a, __m128i b) pure @trusted
{
static if (GDC_with_AVX2 || LDC_with_AVX2)
return cast(__m128i)__builtin_ia32_psllv4si(cast(byte16)a, cast(byte16)b);
else
{
return _mm_setr_epi32(
a[0] << b[0],
a[1] << b[1],
a[2] << b[2],
a[3] << b[3]
);
}
} it uses the << operator which is UB when the shift is > 31 And indeed look at: https://github.com/simd-everywhere/simde/blob/master/simde/x86/avx2.h#L5009 |
Done. |
This list of changes doesn't factor in what may have been added by other people, like
_m256_blendv_epi8
was added upstream but I had already implemented it. I did try to take from upstream rather than myself when there are conflicts, but this list doesn't account for that nor is my list entirely expansive of all my changes.avx512intrin.d
vpopcntdqintrin.d
_mm256_setr_m128*
and_mm256_set1_epi64x
pure_mm256_shuffle_epi8
_mm256_blendv_epi8
_mm256_bslli_epi128
_mm256_bsrli_epi128
_mm256_slli_epi128
_mm256_srli_epi128
_mm_maskload_epi64
_mm256_maskload_epi32
_mm256_maskload_epi64
_mm_sllv_epi32
_mm_sllv_epi64
_mm_srlv_epi32
_mm_srlv_epi64
_mm256_stream_load_si256
(implementsclflush
for correctness if the intrinsic doesn't exist)_mm256_shuffle_epi32
_mm256_shufflehi_epi16
_mm256_shufflelo_epi16
_mm256_popcnt_epi32
_mm256_popcnt_epi64
_mm256_popcnt
(pseudo-intrinsic)