-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEMM much slower than GEMV for multiplying column or row vectors #1238
Comments
Hi @peterbell10, |
Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected. |
* Add numerical checking helper to Level 3 rocBLAS * Added check to see if the input is const * Enclosed the kernel function of TRSM with brackets to invoke the destructor and release the handle memory * Addressed the comments
@peterbell10 Can you please check if your issue still exists with the latest ROCm 6.1.2? If resolved, please close the ticket. Thanks! |
Sorry but I no longer have access to an AMD GPU. |
What is the expected behavior
GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).
What actually happens
GEMM performs much worse than GEMV.
How to reproduce
For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script
gemm.py
and run the script under
rocprof
to get kernel timingsWhich shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.
Environment
The text was updated successfully, but these errors were encountered: