[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015Hale423 wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic Made-with: Cursor
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic
cjluo-nv
left a comment
There was a problem hiding this comment.
This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.
|
Thanks for the feedback. Sharing this design, please kindly take a look. Design: ONNX Q/DQ Autotuning with Subgraph ModeDesign: ONNX Q/DQ Autotuning for TensorRTDesign review document for PR #1015 1. BackgroundTensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:
This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements. The design intentionally supports two workflows:
2. Goals
3. Non-goals
4. Scope Relative to
|
Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode
Branch:
dev-wahao-autotune-subgraph-profile→mainType: Feature
Summary
This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. It introduces two workflow modes:
graph.json); profiles isolated subgraphs for much faster tuning on large or dynamic-shape models (~30 min vs ~25 h in practice).Subgraph mode is the main addition over a baseline “auto QDQ placement” implementation: it uses TRT fusion info, optional per-layer timing, incremental full-model validation, and cache/resume.
What’s New (vs main)
modelopt.onnx.quantization.autotunepackage: region discovery, scheme generation, TensorRT benchmarking (Python API + optional trtexec), pattern cache, QDQ baseline import.--mode subgraph: fusion-aware grouping from TensorRTgraph.json; per-subgraph QDQ scheme profiling; optional per-layer timing when trtexec supports it (with fallback to total latency).fusion_grouping.py: parse TRTgraph.json, build fusion groups, infer shapes for extracted subgraphs. If--graph-jsonis omitted, runs trtexec once to generategraph.json(FP16 build with--exportLayerInfo).optimized_raw.onnx(all qualifying QDQ) andoptimized_final.onnx(validated). Default: on (--incremental-validation); use--no-incremental-validationto disable.autotune_cache.jsonfor Phase 2 (subgraph profiling) and Phase 3 (incremental validation). Re-running the same command resumes from the last checkpoint.--use-trtexecplus--trtexec-argsfor benchmarking with dynamic shapes (e.g.--optShapes) and custom options (e.g.--useCudaGraph,--stronglyTyped). trtexec profiling flags are optional; on “Unknown option” the code strips them and retries (fallback to total latency).examples/qdq_placement/: README (Quick Start, region vs subgraph, output layout, subgraph best practices) andset_batch_size.pyfor fixed-batch ResNet50.Key Files
modelopt/onnx/quantization/autotune/__main__.py--mode,--graph-json,--incremental-validation,--use-trtexec,--trtexec-args, etc.modelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/fusion_grouping.pygraph.json, create fusion groups,generate_graph_json()(trtexec FP16 build when no graph is provided).modelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pyexport_profile_path, profiling-flag dedup and “Unknown option” retry without profiling.modelopt/onnx/quantization/autotune/workflows.pybenchmark_onnx_model(); passes throughexport_profile_pathwhen using trtexec.modelopt/onnx/quantization/autotune/autotuner.pymodelopt/onnx/quantization/autotune/region_*.pyexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pyHow to Test
Region mode (no trtexec):
Subgraph mode with trtexec (FP8, optional graph.json):
Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from
autotune_cache.json.Checklist
--use-trtexec(with or without--graph-json).--graph-json, one trtexec FP16 build runs and produces*.fp16.graph.jsonin the output dir.examples/qdq_placement/README.mdmatches behavior (region vs subgraph, outputs, best practices).Documentation
examples/qdq_placement/README.md– Quick Start, subgraph best practices, output layout, optional graph generation.docs/source/guides/9_qdq_placement.rstanddocs/source/reference/2_qdq_placement.rst; confirm they align with the CLI and behavior above when submitting.Notes
--exportProfile/--profilingVerbosityare handled by retrying without those flags and using total latency for scheme selection.