Performance Investigation

Preparation Step

Make sure you are using RelWithDebInfo build. (Debug build is significantly slower and shouldn't be used for benchmarking or performance investigation)

Turn on model dumping and logging by passing in the DebugOptions to ORTModule constructor:

from torch_ort import ORTModule, DebugOptions, LogLevel
ort_model = ORTModule(pt_model, DebugOptions(save_onnx=True, onnx_prefix='<MODEL NAME>', log_level=LogLevel.VERBOSE))

In order to add nvtx markers to associate kernel with op that triggered it, build ORT with --enable_nvtx_profile option. This option adds a performance overhead and must not be used for comparative perf measurement
Run profiling with data size(batch_size/seq_len) that will be used in real training in order to have meaningful evaluation of kernel perf for target sizes
In profiling, identify which kernel timings scale with number of steps/ layers in order to correctly determine the bottleneck

Hint

For better visualization in Netron, you can shrink the model to 1 or 2 layer.
If you see CUDA related failure, please set environment variable by export CUDA_LAUNCH_BLOCKING=1

Notice

*_torch_exported_<mode>.onnx is onnx model directly coming out from the exporter without any graph transformation.
*_pre_grad_optimized_<mode>.onnx is optimized onnx model before the gradient graph is built and before the gradient graph transformations are applied.
*_optimized_<mode>.onnx is the training graph built on top of *_pre_grad_optimized.onnx graph after building the gradient graph and applying the gradient graph transformations.
*_execution_model_<mode>.onnx is the final optimized training graph, the actual graph executed by the execution engine.

Common Performance Problems

Excessive memcpy nodes
Search for 'memcpy' in the *_execution_model_<mode>.onnx. In the ideal case, there should zero memcpy node in the final optimized training graph.
- If the CUDA kernel is missing for an op, you will commonly see a node sandwiched by MemcpyToHost and MemcpyFromHost nodes
- If the producer node and consumer node expect the tensor to be in different device, a memcpy node will be inserted between the nodes
CUDA Kernel Missing for an Op This can be usually discovered by the following methods:
- Look for the logs with following pattern CUDA Kernel not found in registries for Op type: Clip node name: Clip_150
- Look for a node sandwiched by MemcpyToHost and MemcpyFromHost in the *_optimized_<mode>.onnx graph
Excessive Cast nodes
If the graph is converted under Apex-O1, autocast or DeepSpeed-fp16 mode, there might be some excessive Cast nodes left in the graph.
- Look for node surrounded (sandwiched) by Casts node, see if ORT has already implemented fp16-saft kernels for them, i.e. LayerNorm, Gelu
  Common cast target types (check onnx.proto for TensorProto::DataType for complete list) :
  
  to 1 10 6 7
  
  type float float16 int int64
Missing Graph Transformers
ORT is using pattern matching to look for opportunities to apply graph transformations. If the graph is different from the coded pattern, the graph transformation may fail to kick in.
- Look for (Simplified)LayerNormalization in *_pre_grad_optimized_<mode>.onnx graph. The layernorm subgraph (search for Pow node to begin with) should be fused into a single node.
- Look for (Fast)Gelu in *_pre_grad_optimized_<mode>.onnx graph. The gelu subgraph (search for Erf node to begin with) should be fused into a single node.
- Look for stand-alone MatMul nodes in *_execution_model_<mode>.onnx graph. Most of the MatMuls should have been fused with leading Transpose/Scale into FusedMatMul nodes, or fused with following Add into Gemm nodes. Examine the unfused MatMul nodes to see if should have been fuse with surrounding ops.
- Look for stand-alone Dropout node in *_execution_model_<mode>.onnx graph. Examine whether it should be fused with surrounding Add ops into BiasDropout node.
- Look for stand-alone Softmax node in *_execution_model_<mode>.onnx graph. Examine whether it should be fused with the leading Add ops into BiasSoftmax node.

Profiling Tools

nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
torch profiler
Linux perf

Please use the learning roadmap on the home wiki page for building general understanding of ORT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Investigation

Preparation Step

Hint

Notice

Common Performance Problems

Profiling Tools

Navigation by topic

Upcoming Release Roadmap

Glossary

Development

Common Tasks

Dependencies

Core Architecture

Feature Details

Inferencing

Training

Clone this wiki locally