Skip to content

[AMDGPU] Support merging 16-bit TBUFFER load/store instruction #145078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

harrisonGPU
Copy link
Contributor

SILoadStoreOptimizer can now recognise consecutive 16-bit
TBUFFER_LOAD/TBUFFER_STORE instructions that each write

  • a single component (X), or
  • two components (XY),

and fold them into the wider native variants:

X  +  X                 -->  XY
X  +  X  +  X  +  X     -->  XYZW
XY +  XY                -->  XYZW

The optimisation cuts the number of TBUFFER instructions, shrinking code
size and improving memory throughput.

@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

SILoadStoreOptimizer can now recognise consecutive 16-bit
TBUFFER_LOAD/TBUFFER_STORE instructions that each write

  • a single component (X), or
  • two components (XY),

and fold them into the wider native variants:

X  +  X                 -->  XY
X  +  X  +  X  +  X     -->  XYZW
XY +  XY                -->  XYZW

The optimisation cuts the number of TBUFFER instructions, shrinking code
size and improving memory throughput.


Patch is 39.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145078.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+32-6)
  • (modified) llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir (+455)
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index b0d6fd95cd271..83dbad9a1ba20 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1040,8 +1040,21 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
   if (CI.Offset == Paired.Offset)
     return false;
 
+  // Use 2-byte element size if both tbuffer formats are 16-bit.
+  unsigned EltSize = CI.EltSize;
+  auto Has16BitComponents = [&](unsigned Format) -> bool {
+    const auto *Info = AMDGPU::getGcnBufferFormatInfo(Format, STI);
+    return Info && Info->BitsPerComp == 16;
+  };
+
+  if ((CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE)) {
+    // TODO: Support merging 8-bit tbuffer load/store instructions
+    if (Has16BitComponents(CI.Format) && Has16BitComponents(Paired.Format))
+      EltSize = 2;
+  }
+
   // This won't be valid if the offset isn't aligned.
-  if ((CI.Offset % CI.EltSize != 0) || (Paired.Offset % CI.EltSize != 0))
+  if ((CI.Offset % EltSize != 0) || (Paired.Offset % EltSize != 0))
     return false;
 
   if (CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE) {
@@ -1059,13 +1072,26 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
         Info0->NumFormat != Info1->NumFormat)
       return false;
 
-    // TODO: Should be possible to support more formats, but if format loads
-    // are not dword-aligned, the merged load might not be valid.
-    if (Info0->BitsPerComp != 32)
+    // Buffer instructions support up to 4 components per access (e.g., x, xy,
+    // xyz, xyzw).
+    unsigned NumCombinedComponents = CI.Width + Paired.Width;
+    if (NumCombinedComponents > 4)
       return false;
 
-    if (getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, STI) == 0)
+    if (getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, STI) ==
+        0)
       return false;
+
+    // Merge only when the two access ranges are strictly back-to-back,
+    // any gap or overlap can over-write data or leave holes.
+    unsigned BytePerComp = Info0->BitsPerComp / 8;
+    unsigned ElemIndex0 = CI.Offset / BytePerComp;
+    unsigned ElemIndex1 = Paired.Offset / BytePerComp;
+    if (!(ElemIndex0 + CI.Width == ElemIndex1 ||
+          ElemIndex1 + Paired.Width == ElemIndex0))
+      return false;
+
+    return true;
   }
 
   uint32_t EltOffset0 = CI.Offset / CI.EltSize;
@@ -1076,7 +1102,7 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
   // Handle all non-DS instructions.
   if ((CI.InstClass != DS_READ) && (CI.InstClass != DS_WRITE)) {
     if (EltOffset0 + CI.Width != EltOffset1 &&
-            EltOffset1 + Paired.Width != EltOffset0)
+        EltOffset1 + Paired.Width != EltOffset0)
       return false;
     if (CI.CPol != Paired.CPol)
       return false;
diff --git a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
index 9766b427b4325..4a604513e9bbe 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
@@ -8706,3 +8706,458 @@ body:             |
     %8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
     %9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
 ...
+---
+
+name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0,$sgpr1,$sgpr2,$sgpr3,$vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ; GFX9-NEXT: %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ; GFX9-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+    ; GFX10-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+    ; GFX10-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+    ; GFX10-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+    ; GFX11-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+    ; GFX11-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+    ; GFX11-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    %0:vgpr_32 = COPY $vgpr0
+    %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0,%subreg.sub0,$sgpr1,%subreg.sub1,$sgpr2,%subreg.sub2,$sgpr3,%subreg.sub3
+    %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+    %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+    %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_idxen_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX9-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX9-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX9-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX9-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX9-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX9-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN3:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN4:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN5:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN6:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN7:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN8:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX10-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX10-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX10-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX10-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX10-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX10-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX10-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ; GFX10-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+    ; GFX10-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+    ; GFX10-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+    ; GFX10-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX10-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+    ; GFX10-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+    ; GFX10-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+    ; GFX10-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+    ; GFX10-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+    ; GFX10-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX11-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX11-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX11-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX11-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX11-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX11-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX11-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX11-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ; GFX11-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+    ; GFX11-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+    ; GFX11-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+    ; GFX11-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX11-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+    ; GFX11-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+    ; GFX11-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+    ; GFX11-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+    ; GFX11-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+    ; GFX11-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %4:sgpr_32 = COPY $sgpr4
+    %3:sgpr_32 = COPY $sgpr3
+    %2:sgpr_32 = COPY $sgpr2
+    %1:sgpr_32 = COPY $sgpr1
+    %0:sgpr_32 = COPY $sgpr0
+    %5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
+    %8:vgpr_32 = COPY %4:sgpr_32
+    %7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %11:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %13:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %15:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %17:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %19:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %21:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %22:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+    ; GFX9-NEXT: %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 69, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 55, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX11-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX11-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1,%subreg.sub1, $sgpr2,%subreg.sub2, $sgpr3,%subreg.sub3
+    %idx:vgpr_32 = COPY $vgpr0
+    %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN  %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+    %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN  %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 28, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX9-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX9-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 70, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, ...
[truncated]

@harrisonGPU harrisonGPU requested a review from shiltian June 20, 2025 17:37
@harrisonGPU harrisonGPU self-assigned this Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants