Torch PTQ ONNX export example add for windows + small fixes to existing torch onnx export path by hthadicherla · Pull Request #1027 · NVIDIA/Model-Optimizer

hthadicherla · 2026-03-11T19:23:33Z

What does this PR do?

Type of change: new example and bug fix

Added example for torch ptq followed by onnx export followed by GQA graph surgery (for replacing the entire attention subgraph with one single custom node in graph) in windows .

Fixed some issues with the existing export path like the past key values were not reflected as inputs in final model because it was given in a different format during export, and also int8 smooth quant export path where the both the activations and weights are quantized to QDQ instead of DQ only for weights and QDQ for attention.

Usage

NVFP4:

python examples/windows/torch_onnx/llm_export/llm_export.py --hf_model_path "meta-llama/Llama-3.2-3B-Instruct" --dtype nvfp4 --output_dir ./llama3.2-3b-nvfp4

INT4_AWQ:

python examples/windows/torch_onnx/llm_export/llm_export.py --hf_model_path "meta-llama/Llama-3.2-3B-Instruct" --dtype int4_awq --output_dir ./llama3.2-3b-int4-awq

INT8 Smooth Quant:

python examples/windows/torch_onnx/llm_export/llm_export.py --hf_model_path "meta-llama/Llama-3.2-3B-Instruct" --dtype int8_sq --output_dir ./llama3.2-3b-nvfp4

Summary by CodeRabbit

New Features
- Windows-optimized LLM export pipeline with NVFP4 quantization and FP8 support
- INT8 SmoothQuant precision option for model quantization
Documentation
- Windows LLM export guide detailing supported precisions and CLI usage examples

coderabbitai · 2026-03-11T19:23:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 28ba8789-239a-44b9-bcc2-04574a76abbc

📥 Commits

Reviewing files that changed from the base of the PR and between f84cdb1 and 23ee38e.

📒 Files selected for processing (1)

modelopt/onnx/llm_export_utils/export_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/onnx/llm_export_utils/export_utils.py

📝 Walkthrough

Walkthrough

A new Windows-optimized LLM-to-ONNX export pipeline is introduced with NVFP4 quantization support, alongside infrastructure updates enabling INT8 SmoothQuant, dynamic onnx_quantizer_type parameter propagation, improved Cast node handling, and DynamicCache-based KV cache management for export workflows.

Changes

Cohort / File(s)	Summary
Windows LLM Export Pipeline `examples/windows/torch_onnx/llm_export/README.md`, `examples/windows/torch_onnx/llm_export/llm_export.py`, `examples/windows/torch_onnx/llm_export/requirements.txt`	New feature-complete Windows-specific LLM export workflow supporting NVFP4, FP8, INT4, and INT8 SmoothQuant quantization; includes CLI argument parsing, model loading, ONNX export, graph surgery (Transpose dtype fixes, GQA surgery, opset upgrades), and configuration resolution.
Quantization Type Parameter Propagation `modelopt/torch/quantization/tensor_quant.py`, `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`	Adds onnx_quantizer_type optional parameter to FakeTensorQuantFunction.symbolic and forward, including decorator extension for parse_args and propagation through export_int8 calls; updates backward path to accommodate new num_args count.
INT8 Export Path Conditionals `modelopt/torch/quantization/export_onnx.py`	Introduces conditional ONNX INT8 emission: "static" mode emits DequantizeLinear only; else emits QuantizeLinear+DequantizeLinear; adds onnx_quantizer_type parameter, expands trt_high_precision_dtype assertion for "Half", and consolidates scale_dtype handling across FP4/FP8 paths.
Cast Node Handling Enhancement `modelopt/onnx/export/int4_exporter.py`	Updates pre_process to detect and rewire optional Cast nodes between Transpose and MatMul/Gemm when trt_high_precision_dtype is "Half"; makes post_process Cast removal after pre-quant scale conditional based on child count.
INT8 SmoothQuant Support `modelopt/onnx/llm_export_utils/quantization_utils.py`	Adds "int8_sq" precision option mapping to mtq.INT8_SMOOTHQUANT_CFG in get_quant_config and quantize functions; updates error messages accordingly.
Export Utilities Improvements `modelopt/onnx/llm_export_utils/export_utils.py`	Updates WrapperModelForCausalLM to patch DynamicLayer.update and create_causal_mask for export; replaces per-tensor tuple cache with DynamicCache; adjusts hidden_size_per_layer calculation to support head_dim attribute.
Documentation Update `modelopt/onnx/graph_surgery/__init__.py`	Replaces module docstring example usage block with CLI-oriented usage guide for graph_surgery command-line tools; no API changes.

Sequence Diagram(s)

sequenceDiagram
    actor User as User/CLI
    participant Args as llm_arguments<br/>(Parser)
    participant Config as get_config_path<br/>(Config Resolver)
    participant Loader as ModelLoader
    participant Quant as Quantizer
    participant Export as export_raw_llm<br/>(ONNX Export)
    participant Surgery as surgeon_llm<br/>(Graph Surgery)
    participant Output as Output Files

    User->>Args: Parse CLI args (dtype, model_path, etc.)
    Args->>User: Return parsed arguments
    User->>Config: Resolve config.json location
    Config->>User: Return config path
    User->>Loader: Load HF model (if hf_model_path)
    Loader->>User: Return model instance
    User->>Quant: Quantize model (FP8/INT4/INT8/NVFP4)
    Quant->>User: Return quantized model
    User->>Export: Export to raw ONNX
    Export->>Export: Apply LLM export (fp16 or quantized)
    Export->>User: Return raw ONNX path
    User->>Surgery: Apply graph surgery (dtype fixes, GQA, opset updates)
    Surgery->>Surgery: Quantize weights to NVFP4<br/>(if NVFP4 mode)
    Surgery->>Surgery: Apply GQA surgery<br/>(if hf_model_path provided)
    Surgery->>Surgery: Fix logits shape & external data
    Surgery->>User: Return optimized ONNX path
    User->>Output: Save config.json alongside ONNX
    Output->>User: Export complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	The changes include a "# nosec" comment in modelopt/onnx/llm_export_utils/quantization_utils.py without documented justification, violating the nosec bypass policy.	Remove the "# nosec" comment or provide explicit justification in the PR description and secure approval from `@NVIDIA/modelopt-setup-codeowners` before re-submitting.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main contribution: adding a Windows example for Torch PTQ ONNX export and fixing issues in the existing torch ONNX export path.
Docstring Coverage	✅ Passed	Docstring coverage is 88.46% which is sufficient. The required threshold is 80.00%.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hthadicherla/torch-ptq-onnx

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…e in windows and changed torch and onnx export related files which were broken Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/tensor_quant.py (1)

405-408: ⚠️ Potential issue | 🔴 Critical

Backward gradient count mismatch after adding onnx_quantizer_type.

The forward method now accepts 11 input arguments (excluding ctx), but backward returns num_args=10. This will cause the gradient tuple to have insufficient elements. The count should be updated to 11.

🐛 Proposed fix

     `@staticmethod`
     def backward(ctx, grad_outputs):
         """Implements straight through estimation with clipping."""
-        return _fake_quant_backward_function(ctx, grad_outputs, num_args=10)
+        return _fake_quant_backward_function(ctx, grad_outputs, num_args=11)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/tensor_quant.py` around lines 405 - 408, The
backward implementation of the custom autograd Function (method backward) still
calls _fake_quant_backward_function with num_args=10 although forward now
accepts an additional onnx_quantizer_type argument (total 11 inputs excluding
ctx); update the backward call in backward(ctx, grad_outputs) to pass
num_args=11 so the returned gradient tuple matches the forward inputs (reference
symbols: backward, forward, _fake_quant_backward_function, onnx_quantizer_type).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/windows/torch_onnx/llm_export/llm_export.py`:
- Around line 740-746: The current ONNX-only branch renames the external data
file (pre_gqa_data -> model.onnx_data) but does not update the ONNX protobuf
(pre_gqa_onnx/final_onnx), so the model still references the old filename; to
fix, either keep the external data filename unchanged (stop renaming
output_dir/pre_gqa_data) when you rename pre_gqa_onnx -> final_onnx, or load the
protobuf at pre_gqa_onnx and re-save it to final_onnx while updating the
external data location to the new name (so the external_data_location in the
model matches the renamed file); locate the branch handling pre_gqa_onnx,
final_onnx, output_dir and pre_gqa_data and apply one of these two fixes.
- Around line 726-736: The call to replace_attention_with_gqa(...) is hardcoding
max_seq_len=4096; instead read the model config (e.g., load the HuggingFace
config from hf_model_path or the extracted config.json) and derive the
context/window length (e.g., config.max_position_embeddings,
config.context_length, or config.max_seq_len depending on model family) and pass
that value as max_seq_len to replace_attention_with_gqa; update the invocation
that uses pre_gqa_onnx, final_onnx and hf_model_path to compute max_seq_len
before the call and fall back to a sensible default if the config field is
missing.
- Around line 291-322: get_config_path may return None but downstream code calls
shutil.copy(config_path, ...) and os.path.exists(config_path) unconditionally;
update the callers to handle a None config_path (or alternatively make
get_config_path raise a clear FileNotFoundError). Specifically, either (A)
change get_config_path to raise a descriptive exception when no config is found
so callers can fail-fast, or (B) more minimally, guard every use of config_path
(the shutil.copy(config_path, ...) calls and any os.path.exists(config_path)
checks) with an explicit if config_path is not None: ... else: log a warning and
skip the copy/check so ONNX-only runs without a colocated config.json degrade
gracefully. Ensure references to get_config_path and the
shutil.copy(config_path, ...) and os.path.exists(config_path) sites are updated
accordingly.
- Around line 786-809: The current logic overwrites a provided --onnx_path
because the hf_model_path branch always re-exports; change the precedence so
that if args.onnx_path is set you do not re-export. Concretely, keep assigning
raw_onnx_path from args.onnx_path when present, and guard the
ModelLoader/load_model and export_raw_llm calls behind "if args.hf_model_path
and not args.onnx_path" (or equivalent) so ModelLoader, model =
model_loader.load_model(...) and export_raw_llm(...) only run when no onnx_path
was supplied; refer to symbols args.onnx_path, args.hf_model_path,
raw_onnx_path, ModelLoader, load_model, and export_raw_llm to locate and update
the conditional logic.
- Around line 387-423: Pre-quantized local models skip ONNX/checkpoint export
because the block guarded by model_needs_quantization (computed from
modelopt_state) contains llm_to_onnx and export_hf_checkpoint; move or duplicate
the export steps so they run for already-quantized models too. Concretely:
adjust the control flow around model_needs_quantization/modelopt_state so that
after loading or skipping quantize you still call llm_to_onnx (when dtype in
{"fp8","int4_awq","int8_sq","nvfp4"} or otherwise needed) and run
export_hf_checkpoint into quantized_model_dir; keep existing calls to
quantize(), _override_trt_high_precision_dtype, and the dtype-specific Linear
handling (preserve symbols quantize, _override_trt_high_precision_dtype,
llm_to_onnx, export_hf_checkpoint, quantized_model_dir, output_dir) but ensure
exports are performed even when model_needs_quantization is False so
surgeon_llm/main can find output_dir/model.onnx.
- Around line 457-459: infer_shapes_path() is being called with only
raw_onnx_path which mutates the source file; change the call to write to a
separate temp file and load that instead (e.g., create a temporary path like
temp_inferred_path = tempfile.NamedTemporaryFile(suffix=".onnx",
delete=False).name or construct raw_onnx_path + ".inferred.onnx"), call
onnx.shape_inference.infer_shapes_path(raw_onnx_path,
output_path=temp_inferred_path), and then pass temp_inferred_path to
gs.import_onnx(onnx.load(...)) so the original raw_onnx_path (and the saved
original in {output_dir}_raw/) are not modified.

In `@modelopt/onnx/llm_export_utils/export_utils.py`:
- Around line 79-118: The monkey-patches (DynamicLayer.update,
transformers.masking_utils.create_causal_mask, per-model create_causal_mask, and
sdpa_mod.use_gqa_in_sdpa) are applied at import time and never restored; move
these mutations into the export wrapper's forward() method and restore originals
in a finally block. Specifically, in the class that implements forward() (the
exporter wrapper), capture the original symbols (DynamicLayer.update,
transformers.masking_utils.create_causal_mask, the model-specific
create_causal_mask in transformers.models.{model_type}.modeling_{model_type},
and transformers.integrations.sdpa_attention.use_gqa_in_sdpa), apply the patched
lambdas/ functions at the start of forward(), run the export logic, and always
reassign the originals back inside a finally clause so the global state is not
permanently mutated. Ensure you reference and patch the same symbols shown
(DynamicLayer.update, create_causal_mask, and sdpa_mod.use_gqa_in_sdpa) and
handle ImportError/ModuleNotFoundError when restoring the model-specific
create_causal_mask just as in the original diff.

In `@modelopt/torch/quantization/export_onnx.py`:
- Around line 203-206: The code is casting `inputs` instead of the variable
returned (`out`), so the cast is dead; update the block in export_onnx.py to
cast `out` (use g.op("Cast", out, to_i=onnx_dtype_map[input_type])) when
trt_high_precision_dtype != input_type, or if the cast is unnecessary remove the
entire if-block; ensure references are to `out`, `input_type`,
`trt_high_precision_dtype`, `onnx_dtype_map`, and the g.op("Cast") call so the
returned tensor has the intended dtype.

---

Outside diff comments:
In `@modelopt/torch/quantization/tensor_quant.py`:
- Around line 405-408: The backward implementation of the custom autograd
Function (method backward) still calls _fake_quant_backward_function with
num_args=10 although forward now accepts an additional onnx_quantizer_type
argument (total 11 inputs excluding ctx); update the backward call in
backward(ctx, grad_outputs) to pass num_args=11 so the returned gradient tuple
matches the forward inputs (reference symbols: backward, forward,
_fake_quant_backward_function, onnx_quantizer_type).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cf6b1f93-b833-49d6-97b1-94f01c4b66de

📥 Commits

Reviewing files that changed from the base of the PR and between 52f8783 and 6180e38.

📒 Files selected for processing (10)

examples/windows/torch_onnx/llm_export/README.md
examples/windows/torch_onnx/llm_export/llm_export.py
examples/windows/torch_onnx/llm_export/requirements.txt
modelopt/onnx/export/int4_exporter.py
modelopt/onnx/graph_surgery/__init__.py
modelopt/onnx/llm_export_utils/export_utils.py
modelopt/onnx/llm_export_utils/quantization_utils.py
modelopt/torch/quantization/export_onnx.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/tensor_quant.py

coderabbitai · 2026-03-11T19:36:26Z