Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace #124

Open
fengxizhou opened this issue Apr 18, 2024 · 0 comments
Assignees
Labels

Comments

@fengxizhou
Copy link
Contributor

🚀 Motivation and context

Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.

Description

FLOPS calculation

Assuming a matrix multiplication $A_{M \times K} \times B_{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by
$TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.

Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.

Alternatives

No response

Additional context

No response

@fengxizhou fengxizhou self-assigned this Apr 18, 2024
@fengxizhou fengxizhou changed the title Estimate TFLOPS and Memory Bandwidth of PyTorch Matrix Multiplication Operators from Kineto Trace Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant