Kernel Breakdown by Annotation Range #180

jeromeku · 2024-08-22T12:46:28Z

🚀 Motivation and context

Is it possible to correlate kernel distribution with ranges annotated either through torch.cuda.nvtx or torch.profiler.profile?

The use case is model architecture optimization. I'd like a to understand where the bottlenecks are in a model forward / backwards and where the opportunities are for kernel fusion, cuda graphs, etc. Exporting a chrome / tensorboard trace can be helpful for visualizing such areas when model regions are annotated with torch.profiler.record_function (or nvtx) but it would be helpful to have this information available for further analysis as a dataframe.

Description

It would be useful to have kernel breakdown by annotation range aggregated into a dataframe to further investigate problematic modules and layers within the model:

kernel breakdown by annotation region
full correlation trace of the aten / torch ops that dispatched these kernels
additional kernel stats: cudaLaunch time, launch stats (occupancy, grid dim, block dim, kernel args), latency, FLOPs, I/O, etc.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

briancoutinho · 2024-09-17T15:37:51Z

@jeromeku Are you expecting something like a kernel_dataframe with a call_stack column = ["aten:op1_", "aten:op", "module name"..]

Does the call_stack logic help to achieve something similar to your request?
https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/common/call_stack.py

It should be able to link from the kernel up to the operators (and likely user annotations like profiler.profile)

jeromeku · 2024-09-17T17:57:37Z

Something like the "Events" view in nsys, where you can see a trace of kernels by time, grouped by nvtx range. See this for example from this thread.

Essentially what you see when you do prof.key_averages().print_table except:

not aggregated -- full trace by time
retains the nested structure of the annotated range and call stack - e.g., if I annotate a range with record_function('my_range'), I should see a top-level my_range followed by the entire call stack of operators and the kernels they ultimately dispatch to ordered by time along with other collected stats.
can be exported as a pd.Dataframe

hychiang-git · 2025-01-10T05:57:44Z

Same question

jeromeku added feature request New feature request needs triage labels Aug 22, 2024

hychiang-git mentioned this issue Jan 10, 2025

Add get_gpu_user_annotation_breakdown for torch.profiler.record_function #209

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel Breakdown by Annotation Range #180

Kernel Breakdown by Annotation Range #180

jeromeku commented Aug 22, 2024 •

edited

Loading

briancoutinho commented Sep 17, 2024

jeromeku commented Sep 17, 2024

hychiang-git commented Jan 10, 2025

Kernel Breakdown by Annotation Range #180

Kernel Breakdown by Annotation Range #180

Comments

jeromeku commented Aug 22, 2024 • edited Loading

🚀 Motivation and context

Description

Alternatives

Additional context

briancoutinho commented Sep 17, 2024

jeromeku commented Sep 17, 2024

hychiang-git commented Jan 10, 2025

jeromeku commented Aug 22, 2024 •

edited

Loading