Skip to content

[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664

Draft
ajrasane wants to merge 5 commits into
mainfrom
ajrasane/sparse-quant-trt-example
Draft

[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664
ajrasane wants to merge 5 commits into
mainfrom
ajrasane/sparse-quant-trt-example

Conversation

@ajrasane

@ajrasane ajrasane commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new example

Adds examples/sparse_quant_trt/ (pipeline.py + README.md), a self-contained end-to-end flow on Qwen2.5-1.5B-Instruct:

[2:4 structured sparsity] -> INT8 W8A8 SmoothQuant PTQ -> [QAT] -> ONNX export (opset 20) -> strongly-typed TensorRT engine (trtexec) -> structured-sparse INT8 kernel validation -> real greedy text generation.

  • INT8 quantization covers both the linear projections AND the attention math (q/k/v_bmm + softmax).
  • Sparsity (--sparsity) and QAT (--qat) are opt-in and OFF by default. With --sparsity, INT8 output quantizers are auto-enabled so each GEMM is INT8-in/INT8-out — the epilogue condition under which TensorRT actually selects structured-sparse INT8 (SPMMA) kernels.
  • --weights-dtype fp16 (default) exports a native fp16 graph; both fp16 and fp32 build with TensorRT --stronglyTyped.
  • --compare-baseline additionally builds an FP16 (unquantized, dense) engine and profiles both engines with trtexec, using the profiling parameters from modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (--warmUp/--avgRuns/--iterations/--noDataTransfers/--useCudaGraph/--useSpinWait), and reports the optimized engine's throughput/latency speedup.
  • The built-in QAT loop is a minimal example placeholder — integrate your own dataset/training pipeline for real accuracy recovery (required after 2:4 sparsification).

Container + library versions/commits used:

  • Docker container: nvcr.io/nvidia/pytorch:26.01-py3
  • PyTorch 2.10.0a0+a36e1d39eb (git commit a36e1d39eb)
  • ONNX 1.18.0
  • TensorRT 10.14.1.48 (trtexec v101401)
  • CUDA 13.1
  • NVIDIA ModelOpt 0.45.0rc0 (this repository, installed editable)
  • transformers 5.9.0 (supported range >=4.56,<5.10), accelerate 1.13.0
  • Model: Qwen/Qwen2.5-1.5B-Instruct; GPU: NVIDIA RTX 6000 Ada Generation (sm_89)

Usage

# Default: INT8 W8A8 SmoothQuant (linear projections + attention) -> ONNX -> TensorRT -> generation
python examples/sparse_quant_trt/pipeline.py

# Add 2:4 structured sparsity (TensorRT selects sparse INT8 kernels); add --qat to recover accuracy
python examples/sparse_quant_trt/pipeline.py --sparsity [--qat]

# Also build an FP16 (unquantized, dense) baseline engine and profile both with trtexec
python examples/sparse_quant_trt/pipeline.py --compare-baseline

Testing

Run inside nvcr.io/nvidia/pytorch:26.01-py3 on an RTX 6000 Ada (sm_89):

  • Default INT8 path (linear projections + attention): strongly-typed engine builds; greedy generation matches the fp16 reference (e.g. "What is the capital of France? Answer in one word." -> Paris.).
  • --sparsity path: the exported ONNX carries 2:4-sparse weight tensors and TensorRT selects structured-sparse INT8 kernels for the projections; engine runs end-to-end (output is degraded without QAT, as documented).
  • --compare-baseline: builds and profiles an FP16 baseline against the optimized engine with trtexec.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices.

  • Is this change backward compatible?: N/A (new example; no APIs changed)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (example script)
  • Did you update Changelog?: N/A (new example)

New example examples/sparse_quant_trt/pipeline.py: an end-to-end flow for
Qwen2.5-1.5B-Instruct that applies optional 2:4 structured sparsity and INT8
W8A8 SmoothQuant PTQ (with optional QAT), exports to ONNX, builds a
strongly-typed TensorRT engine, validates structured-sparse INT8 kernel
selection, and runs greedy text inference. Tested in
nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0a0+a36e1d39eb, ONNX 1.18.0,
TensorRT 10.14.1.48, CUDA 13.1, ModelOpt 0.45.0rc0, transformers 5.9.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16af6479-36d5-4fd5-8dfc-23bec679e10f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajrasane/sparse-quant-trt-example

Comment @coderabbitai help to get the list of available commands and usage tips.

Documents examples/sparse_quant_trt: pipeline overview, tested Docker container and
library versions, setup, usage and key flags, per-stage description, when TensorRT
selects structured-sparse INT8 kernels (Found vs Chose), and the 2:4-sparsity
accuracy caveat (requires QAT/SAT recovery).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.58%. Comparing base (d3acf45) to head (dff1aa1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1664      +/-   ##
==========================================
- Coverage   56.59%   56.58%   -0.02%     
==========================================
  Files         507      507              
  Lines       55794    55794              
==========================================
- Hits        31579    31573       -6     
- Misses      24215    24221       +6     
Flag Coverage Δ
unit 54.40% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ajrasane and others added 3 commits June 10, 2026 00:49
…example

Add --compare-baseline: build an FP16 (unquantized, dense) TensorRT engine from the
same model and profile it against the optimized engine with trtexec, using the same
profiling parameters as modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py
(--warmUp / --avgRuns / --iterations / --noDataTransfers / --useCudaGraph / --useSpinWait).
Reports throughput (qps) and median GPU-compute/latency for both engines plus the
optimized engine's speedup. Documents the flag and method in the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
INT8-quantize the attention BMMs and softmax in addition to the linear projections
by default (--no-quant-attention reverts to a linears-only graph). Update the script
docstring and the example README accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
INT8-quantize the attention BMMs and softmax (q/k/v_bmm + softmax) unconditionally
as part of the default INT8 path, and remove the optional --quant-attention /
--no-quant-attention flag. Update the docstring and README accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant