would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

cetio · 2024-10-17T17:36:38Z

This list of changes doesn't factor in what may have been added by other people, like _m256_blendv_epi8 was added upstream but I had already implemented it. I did try to take from upstream rather than myself when there are conflicts, but this list doesn't account for that nor is my list entirely expansive of all my changes.

Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d
Make _mm256_setr_m128* and _mm256_set1_epi64x pure
Add _mm256_shuffle_epi8
Add _mm256_blendv_epi8
Add _mm256_bslli_epi128
Add _mm256_bsrli_epi128
Add _mm256_slli_epi128
Add _mm256_srli_epi128
Add _mm_maskload_epi64
Add _mm256_maskload_epi32
Add _mm256_maskload_epi64
Add _mm_sllv_epi32
Add _mm_sllv_epi64
Add _mm_srlv_epi32
Add _mm_srlv_epi64
Add _mm256_stream_load_si256 (implements clflush for correctness if the intrinsic doesn't exist)
Add _mm256_shuffle_epi32
Add _mm256_shufflehi_epi16
Add _mm256_shufflelo_epi16
Add _mm256_popcnt_epi32
Add _mm256_popcnt_epi64
Add _mm256_popcnt (pseudo-intrinsic)

PsychedelicPalimpsest · 2024-10-17T17:45:59Z

LGTM

p0nce · 2024-10-17T22:52:00Z

Dear Sir,

Welcome to upstreamistan. This must have been a long journey there and back.
This is a valuable and helpful, albeit massive, change request and I will see that it is merged in due time.
I hope you stay for the dessert.

My tentative plan for a Big Merge Event is the following:

1. Read your changes and make a list of everything that I think should change.
1. Have YOU do this stuff as much as possible instead of me.
1. Merge stuff, whoever completes the desired changes.

Now comes the list of annoyances.

please remove dub.selections.json

emmintrin.d

_mm_setr_epi64x, well I'm in my open browser on the Guide, and it says it takes __m64 aka long1 to do like in C++ compilers, not long. If the Guide has stupid names we keep them.
Similarly change semantics of
You can absolutely add new "intrinsics" but please mark it with #BONUS tag in comment. else Intel Intrinsics Guide semantics is expected. Though sure in this case it doesn't really hurt.
operator => not usable in all supported compilers which may be as ancient ldc-1.24.0, please create a function sorry.
please do not remove unittests even if they look desperate to find something.

types.d => LGTM
internals.d=> LGTM
package.d => LGTM
vpopcntdqintrin.d=> _mm256_popcnt should have #BONUS tag, otherwise LGTM, nice use of SAD
avxintrin.d => some functions can be pure @safe even, otherwise LGTM

avx2intrin.d

_mm256_bslli_epi128 crash in dub test --compiler ldc2 -a x86_64 -b unittest-release-inst -f
6 recent intrinsics would be erased by the merge, and a few comments, so this has to be merged carefully (or rebased)
I still think scope (exit) _mm_clflush(mem_addr); line is wrong: streaming load are more loose than regular loads. Fences are needed when you use streaming load/store, not when not using them. Think about it.
If you have stamina left (or if you didn't leave from the start), you can add a // PERF arm64 or // PERF DMD passive aggressive comment for the maintainer

I hope you stay and keep bringing stuff up.

p0nce · 2024-10-18T08:56:46Z

LGTM

I would appreciate if you don't comment commits you didn't review @HeronErin . This literally deletes 6 intrinsics so it cannot be merged as is.

cetio · 2024-10-18T12:17:31Z

Will address the issues today probably. Almost everything I've done with emmintrin was really bad and presumably I was under a dark wizard's mind control when I wrote it.

Comment out some inline asm stuff due to tests miraculously failing Add _mm_stream_load_si128nt Add _mm256_stream_load_si256nt

cetio · 2024-10-18T23:22:43Z

May or may not have erased previous 28 commits with a force push but fixes have been implemented...

I believe the test fails on LDC release were due to some weird shenanigans with inline assembly so I had to comment that out, unfortunate but some time I'll have to look into why that was happening as it also happened for __m256_popcnt unittest but none others.

Edit

Weird shenanigans may have been VEX instructions wanting me to have the return symbol as the destination, which makes sense in hindsight. Has been fixed alongside a few new changes.

Add _mm256_permute4x64_epi64 Fix the inline asm issue Mild optimization for bslli & bsrli mask generation Add some function attributes

p0nce · 2024-10-19T13:42:24Z

When test fails with optimization and there is assembly, it usually means the assembly was actually wrong and doesn't preserve registers correctly. In many many cases, there is a Inline IR or builtin or sequence of code to avoid the assembly. And yes I'm not sure it even work for all targets of x86 / combination of flags.

cetio · 2024-10-19T18:14:59Z

I avoid writing D's agnostic inline assembly but if you're aware of a case in which something like

cast(__m256i)__asm!(long4)("
    vpermq $2, $1, $0"
, "=v,v,n", a, IMM8);

won't generate properly on LDC with AVX2 then I'll sink some hours into finding a higher level way to do it, presumably with shufflevector. But personally I'm unaware of such a case and since it's LLVM I would imagine it should generate properly always.

The problem with unittests failing is fixed and I'm guessing it was because optimizations were leading to the first operand being contaminated.

p0nce · 2024-10-19T19:23:06Z

Yes, saw the inline asm changing! It will probably be ok.

p0nce · 2024-10-19T19:25:21Z

Your avx2intrin.d changes have disappeared from the PR, is this intentional?

cetio · 2024-10-19T20:13:30Z

Commit history was wiped because I force pushed to master but these changes are in effect:

Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d vnniintrin.d
Make _mm256_setr_m128* and _mm256_set1_epi64x pure
Add _mm256_shuffle_epi8
Add _mm256_bslli_epi128
Add _mm256_bsrli_epi128
Add _mm256_slli_epi128
Add _mm256_srli_epi128
Add _mm_sllv_epi32
Add _mm_sllv_epi64
Add _mm_srlv_epi32
Add _mm_srlv_epi64
Add _mm256_shuffle_epi32
Add _mm256_shufflehi_epi16
Add _mm256_shufflelo_epi16
Add _mm256_popcnt_epi32
Add _mm256_popcnt_epi64
Add _mm256_popcnt (bonus)
Add _mm256_permute4x64_epi64
Add _mm_dpbusd_epi32
Add _mm_stream_load_si128nt (bonus)
Add _mm256_stream_load_si256nt (bonus)
Add _mm256_cvtepi32lo_epi16 (bonus)
Add _mm_dpbusds_epi32
Add _mm_adds_epi32 (bonus)

I've also:

Done a little optimization
Added #BONUS like you said for sequences that aren't actually intrinsic
Reverted anything that isn't on this list like with emmintrin
Added some attributes like pure when possible (some AVX512 stuff could be marked a little more strictly)
Made sure changes from upstream aren't lost (as far as I'm aware)
Added more comments about performance and whatnot for clarity.
Miscellaneous fixes like obviously with the unittest failures on release LDC.

p0nce · 2024-10-19T20:53:55Z

Ah yes my bad. I'm working on something else and will review/merge in the coming week, please hold on.

Add `_mm_adds_epi32` (bonus) Add `_mm_dpbusds_epi32` Move AVX512 to a new folder containing the feature intrinsics Make AVX512 intrinsics marked `nothrow` and `@nogc` Remove some comments that are no longer relevant

p0nce · 2024-10-22T12:19:49Z

OK this is merge day, this will be merged piece by piece on master it's easier to review and change that way. Hence this PR will not get pull as is, but the content should be about the same.

EDIT: I'm sorry this stuff makes me angry

p0nce · 2024-10-22T12:33:34Z

// NOTE Why is this not const(**) like _mm256_stream_load_si256?

In this case, Intel has go and changed the signature to void* since we implemented that, so we're going also for void* even though it should be const(void)*
They also added _mm_load_si64 to fix _mm_loadu_epi64 weird signature.

p0nce · 2024-10-22T13:21:10Z

/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}

Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

p0nce · 2024-10-22T13:41:38Z

Opened #145 to keep track of all remaining review and merging, it's very detailed work as you've seen

p0nce · 2024-10-22T14:05:27Z

// PERF This is almost definitely not the best way to do this.
// Don't quote me on this but I'm pretty sure that there isn't a need to add extra
// code for obvious things like CNT == 8 zeroing half of each lane or whatever because
// shuffle should be able to complete fast enough that whatever optimizations will likely
// lead to negligible performance benefit.

This is a static if.

p0nce · 2024-10-22T14:17:53Z

auto hi = _mm_slli_si128!CNT(_mm256_extractf128_si256!0(a));
auto lo = _mm_slli_si128!CNT(_mm256_extractf128_si256!1(a));
return _mm256_setr_m128i(hi, lo);

Beware double inversion here:

_mm256_setr_m128i first take low lane, then high lane
hi is extracted as low lane and lo as high lane, which is why it works anyway

p0nce · 2024-10-22T14:25:27Z

When you don't know how an intrinsics should be implemented in LDC, you can look at: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h

For example here: _mm256_bslli_epi128 is using a builtin named __builtin_ia32_pslldqi256_byteshift which we do not have in D. However, it's sometimes possible to find its trace in LLVM with this file: https://github.com/ldc-developers/llvm-project/blob/ldc-release/18.x/llvm/include/llvm/IR/IntrinsicsX86.td (in which case it would be available with a pragma intrinsic). Here there is nothing here, so the instruction is probably available with shufflevectorLDC and a builtin in GDC.

cetio · 2024-10-22T18:30:19Z

/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}
Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

The reason I did the static if is because I'd rather have better control over the operations so I can fine tune optimization and also later it makes it simpler to add AVX512 optimizations, not because I was worried about being unable to access an intrinsic. It's just not always best to do broad operations when you could modularize based on hardware/flags instead.

cetio · 2024-10-22T18:33:54Z

Yeah I figured shufflevector could probably be used almost everywhere that I used inline asm, which could probably simplify and make outputs more reliable. I didn't implement it because I figured it was simpler to just use assembly as it should always output fine given the proper flags and I try to make sure the slow path should be optimal. I should have implemented GDC builtins though.

p0nce · 2024-10-22T20:21:05Z

Absolutely.
Yes, I think the pros and cons are:

pros of LLVM asm :

-O0 support, very nice
fast to build,
inline correctly vs D inline asm.
IR can be slow to compile (see __ir_pure seems much more expensive in large projects ldc-developers/ldc#4388). shufflevector and anything making new LLVM IR takes about 50ms more at compile time, in that case it's okay since not instantiated. This is a debug build concern only.

cons of LLVM asm:

inline IR like shufflevector work for any arch,
almost always optimal, I never had a case where it was inferior
not sure about VEX effect for intrinsics that can be in AVX or non-AVX

p0nce · 2024-10-22T20:56:49Z

That one is interesting.

The _mm_srlv_xxx and _mm_sllv_xxx are wrong because the instruction and intrinsics have a defined semantic for shift larger or equal to bitness

So you could shift by say, 78 bits.

However when implemented:

__m128i _mm_sllv_epi32(__m128i a, __m128i b) pure @trusted
{
    static if (GDC_with_AVX2 || LDC_with_AVX2)
        return cast(__m128i)__builtin_ia32_psllv4si(cast(byte16)a, cast(byte16)b);
    else
    {
        return _mm_setr_epi32(
            a[0] << b[0],
            a[1] << b[1],
            a[2] << b[2],
            a[3] << b[3]
        );
    }
}

it uses the << operator which is UB when the shift is > 31

And indeed the results will differ in x86 vs arm.
So we have to make this one slower to imitate the instruction semantics.

And indeed look at: https://github.com/simd-everywhere/simde/blob/master/simde/x86/avx2.h#L5009

p0nce · 2024-10-26T14:01:29Z

Done.

Reinstate all of the good code

c67b11b

Comment out some inline asm stuff due to tests miraculously failing Add _mm_stream_load_si128nt Add _mm256_stream_load_si256nt

cetio force-pushed the master branch from be4807d to c67b11b Compare October 18, 2024 23:18

cetio added 2 commits October 19, 2024 03:20

Add _mm_dpbusd_epi32

953f580

Add _mm256_permute4x64_epi64 Fix the inline asm issue Mild optimization for bslli & bsrli mask generation Add some function attributes

whoopsies!11!

6cb279e

cetio added 2 commits October 20, 2024 23:26

Add _mm256_cvtepi32lo_epi16 (bonus)

eece90c

Add `_mm_adds_epi32` (bonus) Add `_mm_dpbusds_epi32` Move AVX512 to a new folder containing the feature intrinsics Make AVX512 intrinsics marked `nothrow` and `@nogc` Remove some comments that are no longer relevant

Add LDC saturated intrinsic for _mm_adds_epi32

1a47c4f

p0nce closed this Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

cetio commented Oct 17, 2024

PsychedelicPalimpsest commented Oct 17, 2024

p0nce commented Oct 17, 2024 •

edited

Loading

p0nce commented Oct 18, 2024 •

edited

Loading

cetio commented Oct 18, 2024

cetio commented Oct 18, 2024 •

edited

Loading

p0nce commented Oct 19, 2024 •

edited

Loading

cetio commented Oct 19, 2024 •

edited

Loading

p0nce commented Oct 19, 2024

p0nce commented Oct 19, 2024

cetio commented Oct 19, 2024 •

edited

Loading

p0nce commented Oct 19, 2024

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024

p0nce commented Oct 22, 2024

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

cetio commented Oct 22, 2024 •

edited

Loading

cetio commented Oct 22, 2024 •

edited by p0nce

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 26, 2024

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

Conversation

cetio commented Oct 17, 2024

PsychedelicPalimpsest commented Oct 17, 2024

p0nce commented Oct 17, 2024 • edited Loading

p0nce commented Oct 18, 2024 • edited Loading

cetio commented Oct 18, 2024

cetio commented Oct 18, 2024 • edited Loading

p0nce commented Oct 19, 2024 • edited Loading

cetio commented Oct 19, 2024 • edited Loading

p0nce commented Oct 19, 2024

p0nce commented Oct 19, 2024

cetio commented Oct 19, 2024 • edited Loading

p0nce commented Oct 19, 2024

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 22, 2024

p0nce commented Oct 22, 2024

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 22, 2024 • edited Loading

cetio commented Oct 22, 2024 • edited Loading

cetio commented Oct 22, 2024 • edited by p0nce Loading

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 22, 2024 • edited Loading

p0nce commented Oct 26, 2024

p0nce commented Oct 17, 2024 •

edited

Loading

p0nce commented Oct 18, 2024 •

edited

Loading

cetio commented Oct 18, 2024 •

edited

Loading

p0nce commented Oct 19, 2024 •

edited

Loading

cetio commented Oct 19, 2024 •

edited

Loading

cetio commented Oct 19, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

cetio commented Oct 22, 2024 •

edited

Loading

cetio commented Oct 22, 2024 •

edited by p0nce

Loading

p0nce commented Oct 22, 2024 •

edited

Loading

p0nce commented Oct 22, 2024 •

edited

Loading