Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

fengxizhou · 2024-04-18T02:50:59Z

🚀 Motivation and context

Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.

Description

FLOPS calculation

Assuming a matrix multiplication $A_{M \times K} \times B_{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by
$TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.

Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.

Alternatives

No response

Additional context

No response

fengxizhou added feature request New feature request needs triage labels Apr 18, 2024

fengxizhou self-assigned this Apr 18, 2024

fengxizhou changed the title ~~Estimate TFLOPS and Memory Bandwidth of PyTorch Matrix Multiplication Operators from Kineto Trace~~ Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

fengxizhou commented Apr 18, 2024

Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

Comments

fengxizhou commented Apr 18, 2024

🚀 Motivation and context

Description

FLOPS calculation

Alternatives

Additional context