You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.
Description
FLOPS calculation
Assuming a matrix multiplication $A_{M \times K} \times B_{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by $TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.
Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
fengxizhou
changed the title
Estimate TFLOPS and Memory Bandwidth of PyTorch Matrix Multiplication Operators from Kineto Trace
Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace
Apr 18, 2024
🚀 Motivation and context
Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.
Description
FLOPS calculation
Assuming a matrix multiplication$A_{M \times K} \times B_{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by
$TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$ .
Here,$M$ , $K$ , and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: