Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

Closed
wants to merge 5 commits into from

Conversation

cetio
Copy link

@cetio cetio commented Oct 17, 2024

This list of changes doesn't factor in what may have been added by other people, like _m256_blendv_epi8 was added upstream but I had already implemented it. I did try to take from upstream rather than myself when there are conflicts, but this list doesn't account for that nor is my list entirely expansive of all my changes.

  1. Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d
  2. Make _mm256_setr_m128* and _mm256_set1_epi64x pure
  3. Add _mm256_shuffle_epi8
  4. Add _mm256_blendv_epi8
  5. Add _mm256_bslli_epi128
  6. Add _mm256_bsrli_epi128
  7. Add _mm256_slli_epi128
  8. Add _mm256_srli_epi128
  9. Add _mm_maskload_epi64
  10. Add _mm256_maskload_epi32
  11. Add _mm256_maskload_epi64
  12. Add _mm_sllv_epi32
  13. Add _mm_sllv_epi64
  14. Add _mm_srlv_epi32
  15. Add _mm_srlv_epi64
  16. Add _mm256_stream_load_si256 (implements clflush for correctness if the intrinsic doesn't exist)
  17. Add _mm256_shuffle_epi32
  18. Add _mm256_shufflehi_epi16
  19. Add _mm256_shufflelo_epi16
  20. Add _mm256_popcnt_epi32
  21. Add _mm256_popcnt_epi64
  22. Add _mm256_popcnt (pseudo-intrinsic)

@PsychedelicPalimpsest
Copy link

LGTM

@p0nce
Copy link
Collaborator

p0nce commented Oct 17, 2024

Dear Sir,

Welcome to upstreamistan. This must have been a long journey there and back.
This is a valuable and helpful, albeit massive, change request and I will see that it is merged in due time.
I hope you stay for the dessert.

My tentative plan for a Big Merge Event is the following:

    1. Read your changes and make a list of everything that I think should change.
    1. Have YOU do this stuff as much as possible instead of me.
    1. Merge stuff, whoever completes the desired changes.

Now comes the list of annoyances.

  • please remove dub.selections.json

emmintrin.d

  • _mm_setr_epi64x, well I'm in my open browser on the Guide, and it says it takes __m64 aka long1 to do like in C++ compilers, not long. If the Guide has stupid names we keep them.
  • Similarly change semantics of image
    You can absolutely add new "intrinsics" but please mark it with #BONUS tag in comment. else Intel Intrinsics Guide semantics is expected. Though sure in this case it doesn't really hurt.
  • operator => not usable in all supported compilers which may be as ancient ldc-1.24.0, please create a function sorry.
  • please do not remove unittests even if they look desperate to find something.

types.d => LGTM
internals.d=> LGTM
package.d => LGTM
vpopcntdqintrin.d=> _mm256_popcnt should have #BONUS tag, otherwise LGTM, nice use of SAD
avxintrin.d => some functions can be pure @safe even, otherwise LGTM

avx2intrin.d

  • _mm256_bslli_epi128 crash in dub test --compiler ldc2 -a x86_64 -b unittest-release-inst -f
  • 6 recent intrinsics would be erased by the merge, and a few comments, so this has to be merged carefully (or rebased)
  • I still think scope (exit) _mm_clflush(mem_addr); line is wrong: streaming load are more loose than regular loads. Fences are needed when you use streaming load/store, not when not using them. Think about it.
  • If you have stamina left (or if you didn't leave from the start), you can add a // PERF arm64 or // PERF DMD passive aggressive comment for the maintainer

I hope you stay and keep bringing stuff up.

@p0nce
Copy link
Collaborator

p0nce commented Oct 18, 2024

LGTM

I would appreciate if you don't comment commits you didn't review @HeronErin . This literally deletes 6 intrinsics so it cannot be merged as is.

@cetio
Copy link
Author

cetio commented Oct 18, 2024

Will address the issues today probably. Almost everything I've done with emmintrin was really bad and presumably I was under a dark wizard's mind control when I wrote it.

Comment out some inline asm stuff due to tests miraculously failing
Add _mm_stream_load_si128nt
Add _mm256_stream_load_si256nt
@cetio
Copy link
Author

cetio commented Oct 18, 2024

May or may not have erased previous 28 commits with a force push but fixes have been implemented...

I believe the test fails on LDC release were due to some weird shenanigans with inline assembly so I had to comment that out, unfortunate but some time I'll have to look into why that was happening as it also happened for __m256_popcnt unittest but none others.

Edit

Weird shenanigans may have been VEX instructions wanting me to have the return symbol as the destination, which makes sense in hindsight. Has been fixed alongside a few new changes.

cetio added 2 commits October 19, 2024 03:20
Add _mm256_permute4x64_epi64
Fix the inline asm issue
Mild optimization for bslli & bsrli mask generation
Add some function attributes
@p0nce
Copy link
Collaborator

p0nce commented Oct 19, 2024

When test fails with optimization and there is assembly, it usually means the assembly was actually wrong and doesn't preserve registers correctly. In many many cases, there is a Inline IR or builtin or sequence of code to avoid the assembly. And yes I'm not sure it even work for all targets of x86 / combination of flags.

@cetio
Copy link
Author

cetio commented Oct 19, 2024

I avoid writing D's agnostic inline assembly but if you're aware of a case in which something like

cast(__m256i)__asm!(long4)("
    vpermq $2, $1, $0"
, "=v,v,n", a, IMM8);

won't generate properly on LDC with AVX2 then I'll sink some hours into finding a higher level way to do it, presumably with shufflevector. But personally I'm unaware of such a case and since it's LLVM I would imagine it should generate properly always.

The problem with unittests failing is fixed and I'm guessing it was because optimizations were leading to the first operand being contaminated.

@p0nce
Copy link
Collaborator

p0nce commented Oct 19, 2024

Yes, saw the inline asm changing! It will probably be ok.

@p0nce
Copy link
Collaborator

p0nce commented Oct 19, 2024

Your avx2intrin.d changes have disappeared from the PR, is this intentional?

@cetio
Copy link
Author

cetio commented Oct 19, 2024

Commit history was wiped because I force pushed to master but these changes are in effect:

  1. Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d vnniintrin.d

  2. Make _mm256_setr_m128* and _mm256_set1_epi64x pure

  3. Add _mm256_shuffle_epi8

  4. Add _mm256_bslli_epi128

  5. Add _mm256_bsrli_epi128

  6. Add _mm256_slli_epi128

  7. Add _mm256_srli_epi128

  8. Add _mm_sllv_epi32

  9. Add _mm_sllv_epi64

  10. Add _mm_srlv_epi32

  11. Add _mm_srlv_epi64

  12. Add _mm256_shuffle_epi32

  13. Add _mm256_shufflehi_epi16

  14. Add _mm256_shufflelo_epi16

  15. Add _mm256_popcnt_epi32

  16. Add _mm256_popcnt_epi64

  17. Add _mm256_popcnt (bonus)

  18. Add _mm256_permute4x64_epi64

  19. Add _mm_dpbusd_epi32

  20. Add _mm_stream_load_si128nt (bonus)

  21. Add _mm256_stream_load_si256nt (bonus)

  22. Add _mm256_cvtepi32lo_epi16 (bonus)

  23. Add _mm_dpbusds_epi32

  24. Add _mm_adds_epi32 (bonus)

I've also:

  1. Done a little optimization
  2. Added #BONUS like you said for sequences that aren't actually intrinsic
  3. Reverted anything that isn't on this list like with emmintrin
  4. Added some attributes like pure when possible (some AVX512 stuff could be marked a little more strictly)
  5. Made sure changes from upstream aren't lost (as far as I'm aware)
  6. Added more comments about performance and whatnot for clarity.
  7. Miscellaneous fixes like obviously with the unittest failures on release LDC.

@p0nce
Copy link
Collaborator

p0nce commented Oct 19, 2024

Ah yes my bad. I'm working on something else and will review/merge in the coming week, please hold on.

cetio added 2 commits October 20, 2024 23:26
Add `_mm_adds_epi32` (bonus)
Add `_mm_dpbusds_epi32`
Move AVX512 to a new folder containing the feature intrinsics
Make AVX512 intrinsics marked `nothrow` and `@nogc`
Remove some comments that are no longer relevant
@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

OK this is merge day, this will be merged piece by piece on master it's easier to review and change that way. Hence this PR will not get pull as is, but the content should be about the same.

EDIT: I'm sorry this stuff makes me angry

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

// NOTE Why is this not const(**) like _mm256_stream_load_si256?

In this case, Intel has go and changed the signature to void* since we implemented that, so we're going also for void* even though it should be const(void)*
They also added _mm_load_si64 to fix _mm_loadu_epi64 weird signature.

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}

Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

Opened #145 to keep track of all remaining review and merging, it's very detailed work as you've seen

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

// PERF This is almost definitely not the best way to do this.
// Don't quote me on this but I'm pretty sure that there isn't a need to add extra
// code for obvious things like CNT == 8 zeroing half of each lane or whatever because
// shuffle should be able to complete fast enough that whatever optimizations will likely
// lead to negligible performance benefit.

This is a static if.

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

auto hi = _mm_slli_si128!CNT(_mm256_extractf128_si256!0(a));
auto lo = _mm_slli_si128!CNT(_mm256_extractf128_si256!1(a));
return _mm256_setr_m128i(hi, lo);

Beware double inversion here:

  • _mm256_setr_m128i first take low lane, then high lane
  • hi is extracted as low lane and lo as high lane, which is why it works anyway

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

When you don't know how an intrinsics should be implemented in LDC, you can look at: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h

For example here: _mm256_bslli_epi128 is using a builtin named __builtin_ia32_pslldqi256_byteshift which we do not have in D. However, it's sometimes possible to find its trace in LLVM with this file: https://github.com/ldc-developers/llvm-project/blob/ldc-release/18.x/llvm/include/llvm/IR/IntrinsicsX86.td (in which case it would be available with a pragma intrinsic). Here there is nothing here, so the instruction is probably available with shufflevectorLDC and a builtin in GDC.

@cetio
Copy link
Author

cetio commented Oct 22, 2024

/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}

Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

The reason I did the static if is because I'd rather have better control over the operations so I can fine tune optimization and also later it makes it simpler to add AVX512 optimizations, not because I was worried about being unable to access an intrinsic. It's just not always best to do broad operations when you could modularize based on hardware/flags instead.

@cetio
Copy link
Author

cetio commented Oct 22, 2024

Yeah I figured shufflevector could probably be used almost everywhere that I used inline asm, which could probably simplify and make outputs more reliable. I didn't implement it because I figured it was simpler to just use assembly as it should always output fine given the proper flags and I try to make sure the slow path should be optimal. I should have implemented GDC builtins though.

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

Absolutely.
Yes, I think the pros and cons are:

pros of LLVM asm :

cons of LLVM asm:

  • inline IR like shufflevector work for any arch,
  • almost always optimal, I never had a case where it was inferior
  • not sure about VEX effect for intrinsics that can be in AVX or non-AVX

@p0nce
Copy link
Collaborator

p0nce commented Oct 22, 2024

That one is interesting.

The _mm_srlv_xxx and _mm_sllv_xxx are wrong because the instruction and intrinsics have a defined semantic for shift larger or equal to bitness
image

So you could shift by say, 78 bits.

However when implemented:

__m128i _mm_sllv_epi32(__m128i a, __m128i b) pure @trusted
{
    static if (GDC_with_AVX2 || LDC_with_AVX2)
        return cast(__m128i)__builtin_ia32_psllv4si(cast(byte16)a, cast(byte16)b);
    else
    {
        return _mm_setr_epi32(
            a[0] << b[0],
            a[1] << b[1],
            a[2] << b[2],
            a[3] << b[3]
        );
    }
}

it uses the << operator which is UB when the shift is > 31
image
And indeed the results will differ in x86 vs arm.
So we have to make this one slower to imitate the instruction semantics.

And indeed look at: https://github.com/simd-everywhere/simde/blob/master/simde/x86/avx2.h#L5009

@p0nce
Copy link
Collaborator

p0nce commented Oct 26, 2024

Done.

@p0nce p0nce closed this Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants