You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to correlate kernel distribution with ranges annotated either through torch.cuda.nvtx or torch.profiler.profile?
The use case is model architecture optimization. I'd like a to understand where the bottlenecks are in a model forward / backwards and where the opportunities are for kernel fusion, cuda graphs, etc. Exporting a chrome / tensorboard trace can be helpful for visualizing such areas when model regions are annotated with torch.profiler.record_function (or nvtx) but it would be helpful to have this information available for further analysis as a dataframe.
Description
It would be useful to have kernel breakdown by annotation range aggregated into a dataframe to further investigate problematic modules and layers within the model:
kernel breakdown by annotation region
full correlation trace of the aten / torch ops that dispatched these kernels
additional kernel stats: cudaLaunch time, launch stats (occupancy, grid dim, block dim, kernel args), latency, FLOPs, I/O, etc.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Something like the "Events" view in nsys, where you can see a trace of kernels by time, grouped by nvtx range. See this for example from this thread.
Essentially what you see when you do prof.key_averages().print_table except:
not aggregated -- full trace by time
retains the nested structure of the annotated range and call stack - e.g., if I annotate a range with record_function('my_range'), I should see a top-level my_range followed by the entire call stack of operators and the kernels they ultimately dispatch to ordered by time along with other collected stats.
🚀 Motivation and context
Is it possible to correlate kernel distribution with ranges annotated either through
torch.cuda.nvtx
ortorch.profiler.profile
?The use case is model architecture optimization. I'd like a to understand where the bottlenecks are in a model forward / backwards and where the opportunities are for kernel fusion, cuda graphs, etc. Exporting a chrome / tensorboard trace can be helpful for visualizing such areas when model regions are annotated with
torch.profiler.record_function
(ornvtx
) but it would be helpful to have this information available for further analysis as a dataframe.Description
It would be useful to have kernel breakdown by annotation range aggregated into a dataframe to further investigate problematic modules and layers within the model:
aten
/torch
ops that dispatched these kernelsAlternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: