Skip to content

Conversation

@zhang-hui-yulo
Copy link

extended MMF_ROWS_PER_BLOCK in mmf to more than warp_size, just keep MMF_ROWS_PER_BLOCK to the old value as I don't do the performance tuning.

Has been tested MMF_ROWS_PER_BLOCK = 64 on my 3080, no enough shared memory when MMF_ROWS_PER_BLOCK = 128

@zhang-hui-yulo zhang-hui-yulo changed the title extended MMF_ROWS_PER_BLOCK cuda: extended MMF_ROWS_PER_BLOCK Nov 6, 2025
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2025
@am17an
Copy link
Collaborator

am17an commented Nov 6, 2025

You can use this PR as a base for other PRs, as such there is no use for this in it's current form right, i.e. this path is not exercised?

@JohannesGaessler
Copy link
Collaborator

Please keep this in the PR for WMMA support, I cannot evaluate these changes in a vacuum.

@zhang-hui-yulo
Copy link
Author

You can use this PR as a base for other PRs, as such there is no use for this in it's current form right, i.e. this path is not exercised?

Yep, the path is not exercised unless you set MMF_ROWS_PER_BLOCK to 64, as the performance might be increased or decreased in test-backend-ops due to the different shape, so I just keep the old 32 value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants