[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline by ajrasane · Pull Request #1664 · NVIDIA/Model-Optimizer

ajrasane · 2026-06-10T00:25:46Z

What does this PR do?

Type of change: new example

Adds examples/sparse_quant_trt/ (pipeline.py + README.md), a self-contained end-to-end flow on Qwen2.5-1.5B-Instruct:

[2:4 structured sparsity] -> INT8 W8A8 SmoothQuant PTQ -> [QAT] -> ONNX export (opset 20) -> strongly-typed TensorRT engine (trtexec) -> structured-sparse INT8 kernel validation -> real greedy text generation.

INT8 quantization covers both the linear projections AND the attention math (q/k/v_bmm + softmax).
Sparsity (--sparsity) and QAT (--qat) are opt-in and OFF by default. With --sparsity, INT8 output quantizers are auto-enabled so each GEMM is INT8-in/INT8-out — the epilogue condition under which TensorRT actually selects structured-sparse INT8 (SPMMA) kernels.
--weights-dtype fp16 (default) exports a native fp16 graph; both fp16 and fp32 build with TensorRT --stronglyTyped.
--compare-baseline additionally builds an FP16 (unquantized, dense) engine and profiles both engines with trtexec, using the profiling parameters from modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (--warmUp/--avgRuns/--iterations/--noDataTransfers/--useCudaGraph/--useSpinWait), and reports the optimized engine's throughput/latency speedup.
The built-in QAT loop is a minimal example placeholder — integrate your own dataset/training pipeline for real accuracy recovery (required after 2:4 sparsification).

Container + library versions/commits used:

Docker container: nvcr.io/nvidia/pytorch:26.01-py3
PyTorch 2.10.0a0+a36e1d39eb (git commit a36e1d39eb)
ONNX 1.18.0
TensorRT 10.14.1.48 (trtexec v101401)
CUDA 13.1
NVIDIA ModelOpt 0.45.0rc0 (this repository, installed editable)
transformers 5.9.0 (supported range >=4.56,<5.10), accelerate 1.13.0
Model: Qwen/Qwen2.5-1.5B-Instruct; GPU: NVIDIA RTX 6000 Ada Generation (sm_89)

Usage

# Default: INT8 W8A8 SmoothQuant (linear projections + attention) -> ONNX -> TensorRT -> generation
python examples/sparse_quant_trt/pipeline.py

# Add 2:4 structured sparsity (TensorRT selects sparse INT8 kernels); add --qat to recover accuracy
python examples/sparse_quant_trt/pipeline.py --sparsity [--qat]

# Also build an FP16 (unquantized, dense) baseline engine and profile both with trtexec
python examples/sparse_quant_trt/pipeline.py --compare-baseline

Testing

Run inside nvcr.io/nvidia/pytorch:26.01-py3 on an RTX 6000 Ada (sm_89):

Default INT8 path (linear projections + attention): strongly-typed engine builds; greedy generation matches the fp16 reference (e.g. "What is the capital of France? Answer in one word." -> Paris.).
--sparsity path: the exported ONNX carries 2:4-sparse weight tensors and TensorRT selects structured-sparse INT8 kernels for the projections; engine runs end-to-end (output is degraded without QAT, as documented).
--compare-baseline: builds and profiles an FP16 baseline against the optimized engine with trtexec.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices.

Is this change backward compatible?: N/A (new example; no APIs changed)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (example script)
Did you update Changelog?: N/A (new example)

New example examples/sparse_quant_trt/pipeline.py: an end-to-end flow for Qwen2.5-1.5B-Instruct that applies optional 2:4 structured sparsity and INT8 W8A8 SmoothQuant PTQ (with optional QAT), exports to ONNX, builds a strongly-typed TensorRT engine, validates structured-sparse INT8 kernel selection, and runs greedy text inference. Tested in nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0a0+a36e1d39eb, ONNX 1.18.0, TensorRT 10.14.1.48, CUDA 13.1, ModelOpt 0.45.0rc0, transformers 5.9.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

copy-pr-bot · 2026-06-10T00:25:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-10T00:25:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16af6479-36d5-4fd5-8dfc-23bec679e10f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/sparse-quant-trt-example

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Documents examples/sparse_quant_trt: pipeline overview, tested Docker container and library versions, setup, usage and key flags, per-stage description, when TensorRT selects structured-sparse INT8 kernels (Found vs Chose), and the 2:4-sparsity accuracy caveat (requires QAT/SAT recovery). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

codecov · 2026-06-10T00:40:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.58%. Comparing base (d3acf45) to head (dff1aa1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1664      +/-   ##
==========================================
- Coverage   56.59%   56.58%   -0.02%     
==========================================
  Files         507      507              
  Lines       55794    55794              
==========================================
- Hits        31579    31573       -6     
- Misses      24215    24221       +6

Flag	Coverage Δ
unit	`54.40% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…example Add --compare-baseline: build an FP16 (unquantized, dense) TensorRT engine from the same model and profile it against the optimized engine with trtexec, using the same profiling parameters as modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py (--warmUp / --avgRuns / --iterations / --noDataTransfers / --useCudaGraph / --useSpinWait). Reports throughput (qps) and median GPU-compute/latency for both engines plus the optimized engine's speedup. Documents the flag and method in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

INT8-quantize the attention BMMs and softmax in addition to the linear projections by default (--no-quant-attention reverts to a linears-only graph). Update the script docstring and the example README accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

INT8-quantize the attention BMMs and softmax (q/k/v_bmm + softmax) unconditionally as part of the default INT8 path, and remove the optional --quant-attention / --no-quant-attention flag. Update the docstring and README accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane and others added 3 commits June 10, 2026 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664

[Example] Add 2:4 sparsity -> INT8 SmoothQuant PTQ -> ONNX -> TensorRT pipeline#1664
ajrasane wants to merge 5 commits into
mainfrom
ajrasane/sparse-quant-trt-example

ajrasane commented Jun 10, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajrasane commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajrasane commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading