Rocm jaxlib v0.5.0 warpsize #169

zoranjovanovic-ns · 2025-04-12T07:41:26Z

No description provided.

… warpsize=64

pemeliya

I wonder if the test MultiOutputFusionTest.MultiOutputReduceFusionMajorWithExtraOutput would still fail with this warp size config ?

i-chaochen · 2025-04-14T16:16:29Z

xla/backends/gpu/codegen/emitters/reduction.cc

-      analysis, /*minor_dim=*/input_shape_.back(), WarpSize());
-  int64_t num_warps_per_column = WarpSize();
-  num_threads_ = {num_warps_per_column, WarpSize()};
+      analysis, /*minor_dim=*/input_shape_.back(), kTileSize);


so we need to change as tile size 32 instead of WarpSize(device_info) here? may I ask why?

This is temporary, I believe that reduction algorithm needs modifications in order to work with warp_size==64.
Without this some tests fail.

yes, I also did not find a good solution here. This only applies for column-wise reductions.
They work as follows: one block of 1024 threads (32x32) performs column reduction for 1 vertical stripe of N rows and 32 columns. Basically each warp loads and reduces N/32 rows (each having 32 elements) and writes its resulting reduced row to a shared memory. As a result, we have 32 rows of 32 elements written to shared memory.

After that, we do syncthreads and each warp reads 1 vertical column from shared memory and performs warp-level reduction on it. So, finally each warp just writes its 1 reduced element back to global mem. As a result we have Nx32 stripe reduced to 1x32 row.

To make it working for warp_size=64, we could have 16 warps (16*64 = 1024) processing 1 vertical stripe of N rows and 64 columns. But each warp shall then process N/16 rows and perform 4 writes to shared memory (instead of 1). As a result. we would then have 1 large shared mem array of size 64x64 to be transposed. But I don't have a clear idea how to express this in terms of Indexing maps they is in the reduction emitter.

i-chaochen · 2025-04-14T16:17:24Z

xla/backends/gpu/codegen/emitters/transpose.cc

@@ -87,7 +88,7 @@ TransposeFusion::TransposeFusion(const HloFusionAnalysis& analysis)
      permutation_(transpose_.permutation),
      input_shape_(
          Permute(transpose_.dimensions, InversePermutation(permutation_))),
-      base_block_size_(WarpSize(analysis_.device_info())) {
+      base_block_size_(kTileSize) {


i-chaochen · 2025-04-14T20:38:01Z

@zoranjovanovic-ns WDYT this PR #170

zoranjovanovic-ns · 2025-04-14T20:52:59Z

@zoranjovanovic-ns WDYT this PR #170

There is number of #ifdefs that we cannot upstream, has the same issue with reduce as this PR (reduction algorithm probably should be modified), but if it fixes more tests then we can use it as temporary solution.

cherry-picked warp size passing to triton calls, and globally enabled…

4f4c1ec

… warpsize=64

zoranjovanovic-ns requested review from i-chaochen, draganmladjenovic and pemeliya April 12, 2025 07:41

Fixes.

7d58776

zoranjovanovic-ns force-pushed the rocm-jaxlib-v0.5.0-warpsize branch from 971541b to 7d58776 Compare April 12, 2025 07:54

pemeliya approved these changes Apr 14, 2025

View reviewed changes

i-chaochen reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rocm jaxlib v0.5.0 warpsize #169

Rocm jaxlib v0.5.0 warpsize #169

zoranjovanovic-ns commented Apr 12, 2025

pemeliya left a comment

i-chaochen Apr 14, 2025

zoranjovanovic-ns Apr 14, 2025

pemeliya Apr 17, 2025 •

edited

Loading

i-chaochen Apr 14, 2025

i-chaochen commented Apr 14, 2025

zoranjovanovic-ns commented Apr 14, 2025

Rocm jaxlib v0.5.0 warpsize #169

Are you sure you want to change the base?

Rocm jaxlib v0.5.0 warpsize #169

Conversation

zoranjovanovic-ns commented Apr 12, 2025

pemeliya left a comment

Choose a reason for hiding this comment

i-chaochen Apr 14, 2025

Choose a reason for hiding this comment

zoranjovanovic-ns Apr 14, 2025

Choose a reason for hiding this comment

pemeliya Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

i-chaochen Apr 14, 2025

Choose a reason for hiding this comment

i-chaochen commented Apr 14, 2025

zoranjovanovic-ns commented Apr 14, 2025

pemeliya Apr 17, 2025 •

edited

Loading