[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664
[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664ajrasane wants to merge 5 commits into
Conversation
New example examples/sparse_quant_trt/pipeline.py: an end-to-end flow for Qwen2.5-1.5B-Instruct that applies optional 2:4 structured sparsity and INT8 W8A8 SmoothQuant PTQ (with optional QAT), exports to ONNX, builds a strongly-typed TensorRT engine, validates structured-sparse INT8 kernel selection, and runs greedy text inference. Tested in nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0a0+a36e1d39eb, ONNX 1.18.0, TensorRT 10.14.1.48, CUDA 13.1, ModelOpt 0.45.0rc0, transformers 5.9.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Documents examples/sparse_quant_trt: pipeline overview, tested Docker container and library versions, setup, usage and key flags, per-stage description, when TensorRT selects structured-sparse INT8 kernels (Found vs Chose), and the 2:4-sparsity accuracy caveat (requires QAT/SAT recovery). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1664 +/- ##
==========================================
- Coverage 56.59% 56.58% -0.02%
==========================================
Files 507 507
Lines 55794 55794
==========================================
- Hits 31579 31573 -6
- Misses 24215 24221 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…example Add --compare-baseline: build an FP16 (unquantized, dense) TensorRT engine from the same model and profile it against the optimized engine with trtexec, using the same profiling parameters as modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (--warmUp / --avgRuns / --iterations / --noDataTransfers / --useCudaGraph / --useSpinWait). Reports throughput (qps) and median GPU-compute/latency for both engines plus the optimized engine's speedup. Documents the flag and method in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
INT8-quantize the attention BMMs and softmax in addition to the linear projections by default (--no-quant-attention reverts to a linears-only graph). Update the script docstring and the example README accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
INT8-quantize the attention BMMs and softmax (q/k/v_bmm + softmax) unconditionally as part of the default INT8 path, and remove the optional --quant-attention / --no-quant-attention flag. Update the docstring and README accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
What does this PR do?
Type of change: new example
Adds
examples/sparse_quant_trt/(pipeline.py+README.md), a self-contained end-to-end flow on Qwen2.5-1.5B-Instruct:[2:4 structured sparsity] -> INT8 W8A8 SmoothQuant PTQ -> [QAT] -> ONNX export (opset 20) -> strongly-typed TensorRT engine (trtexec) -> structured-sparse INT8 kernel validation -> real greedy text generation.
q/k/v_bmm+softmax).--sparsity) and QAT (--qat) are opt-in and OFF by default. With--sparsity, INT8 output quantizers are auto-enabled so each GEMM is INT8-in/INT8-out — the epilogue condition under which TensorRT actually selects structured-sparse INT8 (SPMMA) kernels.--weights-dtype fp16(default) exports a native fp16 graph; both fp16 and fp32 build with TensorRT--stronglyTyped.--compare-baselineadditionally builds an FP16 (unquantized, dense) engine and profiles both engines withtrtexec, using the profiling parameters frommodelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py(--warmUp/--avgRuns/--iterations/--noDataTransfers/--useCudaGraph/--useSpinWait), and reports the optimized engine's throughput/latency speedup.Container + library versions/commits used:
nvcr.io/nvidia/pytorch:26.01-py32.10.0a0+a36e1d39eb(git commita36e1d39eb)1.18.010.14.1.48(trtexecv101401)13.10.45.0rc0(this repository, installed editable)5.9.0(supported range>=4.56,<5.10), accelerate1.13.0Qwen/Qwen2.5-1.5B-Instruct; GPU: NVIDIA RTX 6000 Ada Generation (sm_89)Usage
Testing
Run inside
nvcr.io/nvidia/pytorch:26.01-py3on an RTX 6000 Ada (sm_89):Paris.).--sparsitypath: the exported ONNX carries 2:4-sparse weight tensors and TensorRT selects structured-sparse INT8 kernels for the projections; engine runs end-to-end (output is degraded without QAT, as documented).--compare-baseline: builds and profiles an FP16 baseline against the optimized engine with trtexec.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices.
CONTRIBUTING.md: N/A