Skip to content

Conversation

@alihassanijr
Copy link
Contributor

Bugs fixed:

  1. If using preferred cluster, there needs to be a branch so that the universal GEMM wrapper finds the correct base params.
  2. Workspace sizes can change depending on problem shape in Blackwell, and DistGEMM was previously using the per-device shape to evaluate workspace size instead of the per-gemm shape.
  3. Flattened size used to initialize host tensors can overflow (in Hopper example as well)
  4. Preferred and fallback cluster args need to be set explicitly, otherwise if someone modifies the example to use preferred cluster, it will just fail.

@alihassanijr
Copy link
Contributor Author

Will try to check #2696 as well while this is open.

@alihassanijr alihassanijr force-pushed the dist-gemm-blackwell-fixes branch from aca3c0b to acc7b79 Compare October 22, 2025 17:20
@alihassanijr
Copy link
Contributor Author

acc7b79 should address #2696 as well.

@alihassanijr
Copy link
Contributor Author

alihassanijr commented Oct 22, 2025

Tested on GB200:

  TP: 4
  Problem Size: 16384 x 106496 x 16384 x 1
  Local GEMM Problem Size: 4096 x 26624 x 16384 x 1
  Avg runtime: 4.80512 ms
  TFLOPS: 2974.67

B200 (DGX):

  TP: 8
  Problem Size: 16384 x 106496 x 16384 x 1
  Local GEMM Problem Size: 2048 x 13312 x 16384 x 1
  Avg runtime: 2.53334 ms
  TFLOPS: 2821.11

H100 SXM (DGX):

  TP: 8
  Problem Size: 16384 x 106496 x 16384 x 1
  Local GEMM Problem Size: 2048 x 13312 x 16384 x 1
  Avg runtime: 9.37504 ms
  TFLOPS: 762.325

@alihassanijr alihassanijr force-pushed the dist-gemm-blackwell-fixes branch from 08d37d3 to 763f54c Compare October 22, 2025 21:41
@alihassanijr
Copy link
Contributor Author

Ready for review @hwu36 .

1. If using preferred cluster, there needs to be a branch so that
   the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
   and DistGEMM was previously using the per-device shape to evaluate
   workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
   Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
   otherwise if someone modifies the example to use preferred cluster,
   it will just fail.
@alihassanijr alihassanijr force-pushed the dist-gemm-blackwell-fixes branch from e5c5316 to a932f25 Compare November 5, 2025 14:31
@hwu36 hwu36 merged commit d1ef0e8 into NVIDIA:main Nov 6, 2025
@alihassanijr alihassanijr deleted the dist-gemm-blackwell-fixes branch November 6, 2025 18:37
guocuimi pushed a commit to vectorch-ai/cutlass that referenced this pull request Nov 6, 2025
* Blackwell DistGEMM bug fixes

1. If using preferred cluster, there needs to be a branch so that
   the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
   and DistGEMM was previously using the per-device shape to evaluate
   workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
   Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
   otherwise if someone modifies the example to use preferred cluster,
   it will just fail.

* Fix example runtimes

* Set default fallback cluster shapes to the static ones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants