Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Breakdown by Annotation Range #180

Open
jeromeku opened this issue Aug 22, 2024 · 3 comments
Open

Kernel Breakdown by Annotation Range #180

jeromeku opened this issue Aug 22, 2024 · 3 comments
Labels

Comments

@jeromeku
Copy link

jeromeku commented Aug 22, 2024

🚀 Motivation and context

Is it possible to correlate kernel distribution with ranges annotated either through torch.cuda.nvtx or torch.profiler.profile?

The use case is model architecture optimization. I'd like a to understand where the bottlenecks are in a model forward / backwards and where the opportunities are for kernel fusion, cuda graphs, etc. Exporting a chrome / tensorboard trace can be helpful for visualizing such areas when model regions are annotated with torch.profiler.record_function (or nvtx) but it would be helpful to have this information available for further analysis as a dataframe.

Description

It would be useful to have kernel breakdown by annotation range aggregated into a dataframe to further investigate problematic modules and layers within the model:

  • kernel breakdown by annotation region
  • full correlation trace of the aten / torch ops that dispatched these kernels
  • additional kernel stats: cudaLaunch time, launch stats (occupancy, grid dim, block dim, kernel args), latency, FLOPs, I/O, etc.

Alternatives

No response

Additional context

No response

@jeromeku jeromeku added feature request New feature request needs triage labels Aug 22, 2024
@briancoutinho
Copy link
Contributor

@jeromeku Are you expecting something like a kernel_dataframe with a call_stack column = ["aten:op1_", "aten:op", "module name"..]

Does the call_stack logic help to achieve something similar to your request?
https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/common/call_stack.py

It should be able to link from the kernel up to the operators (and likely user annotations like profiler.profile)

@jeromeku
Copy link
Author

Something like the "Events" view in nsys, where you can see a trace of kernels by time, grouped by nvtx range. See this for example from this thread.

Essentially what you see when you do prof.key_averages().print_table except:

  • not aggregated -- full trace by time
  • retains the nested structure of the annotated range and call stack - e.g., if I annotate a range with record_function('my_range'), I should see a top-level my_range followed by the entire call stack of operators and the kernels they ultimately dispatch to ordered by time along with other collected stats.
  • can be exported as a pd.Dataframe

@hychiang-git
Copy link

Same question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants