Add U8 copy operation for K16 MMA #374

aacostadiaz · 2025-05-14T17:14:24Z

This PR adds the U8 copy operation that works correctly with the K16 MMA for FP8 GEMM or mixed dtype GEMM.

# Conflicts: # include/cute/arch/xe_copy_1B.hpp # include/cute/arch/xe_copy_2B.hpp # include/cute/arch/xe_copy_4B.hpp

# Conflicts: # include/cute/arch/mma_xe.hpp

sanchitintel · 2025-05-21T21:11:42Z

With FP8xFP8 GEMM, this config didn't work, but the corresponding code works for FP16xFP16 GEMM:

  using GmemTiledCopyA = XE_2D_U8x32x32_LD_N;
  using GmemTiledCopyB = XE_2D_U8x32x32_LD_V;

  using TileShape = Shape<_64, _256, _32>;

  using TiledMma =
      typename TiledMMAHelper<MMA_Atom<XE_8x16x16_F32F16F16F32_TT>, Layout<TileShape>,
      Layout<Shape<_2, _8, _1>, Stride<_8, _1, _0>>>::TiledMMA;

The compile-time error was

include/cute/atom/copy_traits_xe.hpp:78:19: error: static assertion failed due to requirement 'size(cute::Layout<cute::tuple<cute::C<16>, cute::C<8>>, cute::tuple<cute::C<0>, cute::C<1>>>{}) % size(cute::tuple<cute::C<8>, cute::C<64>>{}) == 0'
   78 |     static_assert(size(LayoutIn{}) % size(BlockShape{}) == 0);

It seems to be a bug since the shapes are correct.

Thanks!

…ked-copy # Conflicts: # CMakeLists.txt # include/cute/arch/copy_xe_U16.hpp # include/cute/arch/copy_xe_U32.hpp # include/cute/arch/copy_xe_U4.hpp # include/cute/arch/copy_xe_U64.hpp # include/cute/arch/copy_xe_U8.hpp # include/cute/arch/copy_xe_builtin.hpp # include/cute/arch/copy_xe_spirv.hpp # include/cutlass/epilogue/collective/xe_epilogue.hpp

sanchitintel · 2025-05-27T21:02:07Z

include/cute/arch/copy_xe_U8.hpp

+struct XE_2D_U8x32x32_LD_N {
+  using BlockShape = Shape<_32, _32>;
+
+  template <class T>
+  CUTE_HOST_DEVICE static void copy(const void *baseoffset, int width,
+                                    int height, int pitch, intel::coord_t coord,
+                                    T *dst) {
+#if defined(CUTE_ARCH_COPY_XE_ENABLED)
+    static_assert(sizeof(T) == 1, "Expected T to have size 1");
+    // detail::XeSubgroup2DBlockLoad<1, 16, 32, 2>{}(baseoffset, width, height, pitch, coord, dst);
+    // Use the transform (VNNI) version as it provides better performance when loading the A matrix for
+    // GEMM FP8 and GEMM mixed-precision types.


Hi @aacostadiaz, can you please elaborate on why loading A in VNNI format is faster? I assume it's later being converted back to plain layout, since the output is correct, so that layout conversion should have had some overhead.

Did you make this change on the basis of an empirical observation of it being faster, or is there any reason why this approach should be expected to perform better? Thanks!

BTW, the DstLayout in atom traits for this copy atom is Layout<Shape <_16,Shape <_8, _2, _32>>, Stride<_16,Stride< _1,_128,_256>>>;, which seems to correspond to plain layout. So, does this mean that initially, when the data would be copied, it'd be transformed into VNNI layout before writing to the registers, and would later be converted to DstLayout somehow? If yes, can you please point out which part of the code handles it?

Also, I don't see any shfl based instructions in the generated assembly dump, so is it possible that the shuffle (for VNNI -> plain layout conversion) may not be happening directly via lane registers -> lane registers (I understand this isn't possible on Nvidia GPUs, but is somehow possible on Intel GPUs, based on the documentation) but lane registers -> shared local memory -> lane registers?

Thanks!

cc @pengzhao-intel @yuankuns

jiyang1011 and others added 22 commits April 7, 2025 19:12

spirv APIs

a6c8e53

mma spirv api

73bef6e

Merge branch 'sycl-develop' into jiyang/spirv_api

6e12cb6

Merge branch 'sycl-develop' into jiyang/spirv_api

626fd13

Merge branch 'sycl-develop' into jiyang/spirv_api

cf6a41b

remove -1 from OCL API

d9f8303

Merge branch 'sycl-develop' into jiyang/spirv_api

c1cddb6

# Conflicts: # include/cute/arch/xe_copy_1B.hpp # include/cute/arch/xe_copy_2B.hpp # include/cute/arch/xe_copy_4B.hpp

rebase

5537fd7

Disable spirv functions for PVC

c89a875

move spirv definitions

5e26dd3

fix

8c67947

Merge branch 'sycl-develop' into jiyang/spirv_api

1af7011

Refactor

879eb35

Fix cmake

9864ab2

Re-enable test

39e549d

Fix mma builtin

d6c9358

Fix copy builtin

ec9d0a7

Revert minor changes

7144422

Merge branch 'sycl-develop' into jiyang/spirv_api

3d30536

# Conflicts: # include/cute/arch/mma_xe.hpp

Use builtin for prefetch

4bbaaa6

Remove FP16 MMA with FP16 accumulator

304de17

Add U8 copy operation for K16 MMA

a2c45b1

aacostadiaz added the incremental Incremental changes label May 14, 2025

sanchitintel mentioned this pull request May 16, 2025

[BUG] XE_2D_U8x32x32_LD_N for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

Open

aacostadiaz added 5 commits May 27, 2025 15:20

fix merge conflict

b962239

Revert changes in the tests

d8e855e

Update GEMM FP8 example

d0e2c94

Merge branch 'sycl-develop' into aacosta/packed-copy

d346207

aacostadiaz removed the incremental Incremental changes label May 27, 2025

sanchitintel reviewed May 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add U8 copy operation for K16 MMA #374

Add U8 copy operation for K16 MMA #374

aacostadiaz commented May 14, 2025

Uh oh!

sanchitintel commented May 21, 2025 •

edited

Loading

Uh oh!

sanchitintel May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add U8 copy operation for K16 MMA #374

Are you sure you want to change the base?

Add U8 copy operation for K16 MMA #374

Conversation

aacostadiaz commented May 14, 2025

Uh oh!

sanchitintel commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchitintel commented May 21, 2025 •

edited

Loading

sanchitintel May 27, 2025 •

edited

Loading