Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CLI flag Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI
participant HF as hf_ptq.py
participant Config as QuantizerConfig
participant Calib as CalibrationEngine
participant Export as Exporter
User->>HF: run with/without --calibrate_kv_cache
HF->>Config: derive KV quantizer configs (deepcopy)
alt calibrate_kv_cache = false
HF->>Config: set constant_amax = 448.0 for KV quantizers
HF->>Calib: skip KV calibration, enable calibration for other quantizers
else calibrate_kv_cache = true
HF->>Calib: enable calibration for KV quantizers only (disable others)
end
Calib->>Config: collect stats / compute scales
HF->>Export: postprocess_state_dict (write KV scales if calibrated)
Export->>User: export checkpoint (KV scales present when calibrated)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
edb0827 to
4176c6d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1012 +/- ##
==========================================
- Coverage 71.73% 70.08% -1.66%
==========================================
Files 211 221 +10
Lines 23948 25504 +1556
==========================================
+ Hits 17180 17874 +694
- Misses 6768 7630 +862 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)
304-323: Consider using defensive.pop()to avoid potential KeyError.Line 311 uses
kv_cache_quant_cfg.pop("default")which will raiseKeyErrorif the "default" key doesn't exist. While current KV configs should have this key, using a defensive approach would be more robust:- kv_cache_quant_cfg.pop("default") # keep other quantizers from auto_quantize + kv_cache_quant_cfg.pop("default", None) # keep other quantizers from auto_quantizeThe overall KV cache calibration logic looks correct:
- When
calibrate_kv_cache=False: Uses constant amax=448.0 (scale=1.0)- When
calibrate_kv_cache=True: Runs data-driven calibration on KV quantizers only🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 304 - 323, Replace the direct pop("default") call with a defensive removal to avoid KeyError: check for the key or use a safe-pop pattern on kv_cache_quant_cfg (the dict built from getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding; keep the rest of the KV cache path the same (including setting kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg / mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when args.calibrate_kv_cache).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 946-949: The current assignment to
quant_cfg["quant_cfg"]["*[kv]_bmm_quantizer"]["constant_amax"] can KeyError for
nvfp4_rotate because that format uses separate "*q_bmm_quantizer",
"*k_bmm_quantizer", "*v_bmm_quantizer" keys; update the branch that runs when
args.kv_cache_qformat != "none" and not args.calibrate_kv_cache to detect
args.kv_cache_qformat == "nvfp4_rotate" and in that case set "constant_amax" on
each of quant_cfg["quant_cfg"]["*q_bmm_quantizer"], "*k_bmm_quantizer" and
"*v_bmm_quantizer" (checking each key exists before assignment), otherwise keep
the existing assignment to "*[kv]_bmm_quantizer" (also guarding existence) so no
KeyError occurs.
In `@modelopt/torch/quantization/config.py`:
- Around line 1033-1043: The new constant_amax config must be validated to be a
finite, strictly positive number to prevent downstream broken scales; update the
ModeloptField for constant_amax so that its validator rejects None? (keep None
allowed) and any values that are <= 0, NaN, or infinite, and raise a clear
validation error. Specifically, add a validation callback or constraint on the
constant_amax ModeloptField declaration (symbol: constant_amax) that uses
math.isfinite(value) and value > 0, and ensure TensorQuantizer._get_amax() can
continue to trust the validated value without additional checks. Ensure the
error message references constant_amax and explains it must be a finite positive
number.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 709-712: The current early `continue` when module._constant_amax
is set skips updating per-module calibration flags and so prevents static-bias
calibrators from collecting bias stats and also leaves stale _if_quant/_if_calib
state; change the branch in model_calib.py so that when getattr(module,
"_constant_amax", None) is not None you do not unconditionally continue but
instead: if module._calibrator indicates a static-bias calibrator (or if
module._calibrator is not None and supports bias calibration) set
module._if_calib = True (and ensure module._if_quant is set/cleared
consistently) so load_calib_bias()/compute_bias() can run, otherwise explicitly
clear module._if_calib and module._if_quant to avoid preserving stale state;
reference the attributes _constant_amax, _calibrator, _if_calib, _if_quant and
the function finish_stats_collection()/load_calib_bias()/compute_bias() when
making the change.
In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 617-618: get_kv_cache_scaling_factor currently assumes every entry
from get_scaling_factor() is non-None and calls factor.item(), which crashes
when a KV quantizer uses constant_amax (export_amax has no _amax buffer and
get_scaling_factor returns None). Fix by guarding against None: in
get_kv_cache_scaling_factor (the loop that iterates scaling_factors when dtype
== KV_CACHE_FP8), either filter out None values before iterating or add an if
factor is None: continue check before calling factor.item(); ensure this
respects constant_amax/_constant_amax semantics so fixed-amax quantizers are
skipped rather than dereferenced.
---
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 304-323: Replace the direct pop("default") call with a defensive
removal to avoid KeyError: check for the key or use a safe-pop pattern on
kv_cache_quant_cfg (the dict built from getattr(mtq,
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding;
keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e7b126c8-ce02-4772-8f0c-d36987d90865
📒 Files selected for processing (7)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/quant_utils.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pytests/_test_utils/torch/quantization/tensor_quantizer_common.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
bde9317 to
ba93ebd
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 313-323: The nvfp4_rotate KV format is left uncalibrated because
the code only sets constant_amax when "*[kv]_bmm_quantizer" is present and only
runs calibration when args.calibrate_kv_cache is True; fix by detecting the
nvfp4_rotate KV format and forcing KV calibration or rejecting the combination:
either set args.calibrate_kv_cache = True when the parsed KV format equals
"nvfp4_rotate" before you build kv_cache_quant_cfg (or immediately before
calling mtq.set_quantizer_by_cfg/mtq.calibrate), or validate arguments earlier
and raise an error if --kv_cache_qformat nvfp4_rotate is used without
--calibrate_kv_cache; apply the same change at the other similar spots that call
mtq.set_quantizer_by_cfg, mtq.set_quantizer_by_cfg_context, and mtq.calibrate so
nvfp4_rotate never runs on the uncalibrated default path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5aaec202-14ac-4b91-92cf-eea7abee0dd2
📒 Files selected for processing (1)
examples/llm_ptq/hf_ptq.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 964-975: The current shortcut unconditionally sets constant_amax
to 448.0 for "*[kv]_bmm_quantizer" which is valid only for FP8 E4M3; change the
logic in the block that references kv_quantizer_cfg, args.kv_cache_qformat, and
args.calibrate_kv_cache so that you either (a) only apply the constant_amax
shortcut when the quantizer format is FP8 (detect via format identifier or
kv_quantizer_cfg properties), or (b) compute constant_amax from the quantizer’s
configured maxbound (read the quantizer format’s maxbound property from
kv_quantizer_cfg) and set constant_amax = maxbound instead of the hardcoded
448.0; ensure you deepcopy quant_cfg before mutating "*[kv]_bmm_quantizer" as
already done.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 72a1fc78-55a0-4637-b95d-7e2673599983
📒 Files selected for processing (1)
examples/llm_ptq/hf_ptq.py
f238de1 to
0a0439c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/gpu/torch/export/test_export.py (1)
211-225:⚠️ Potential issue | 🟡 MinorCover the new default “no KV scale emitted” branch.
These updated assertions only validate the calibrated path (
_amaxpresent → exported scale =amax / maxbound). The PR’s default behavior is the opposite branch: withconstant_amax,postprocess_state_dict()should omitk_scale/v_scaleentirely so downstream falls back to 1.0. Please add aKV_CACHE_FP8case with nok_bmm_quantizer._amax/v_bmm_quantizer._amaxentries; otherwise the main behavior change can regress unnoticed.Suggested test addition
`@pytest.mark.parametrize`( ("state_dict", "quantization", "maxbound", "expected_state_dict"), [ + ( # Default constant-amax KV cache path should emit no KV scales + { + "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]), + }, + KV_CACHE_FP8, + 128.0, + { + "layer1.pre_quant_scale": torch.tensor([0.128]), + }, + ), ( # Test replacements and KV cache scaling { "layer1.k_bmm_quantizer._amax": torch.tensor([0.128]), "layer1.v_bmm_quantizer._amax": torch.tensor([256.0]), "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/export/test_export.py` around lines 211 - 225, Add a test that covers the default "no KV scale emitted" branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax but without providing "layer1.k_bmm_quantizer._amax" or "layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and "layer1.v_proj.v_scale" are not present in the returned state dict (i.e., omitted so downstream will default to 1.0) rather than being computed, mirroring the existing calibrated-case tests that include _amax keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/gpu/torch/export/test_export.py`:
- Around line 211-225: Add a test that covers the default "no KV scale emitted"
branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax
but without providing "layer1.k_bmm_quantizer._amax" or
"layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and
"layer1.v_proj.v_scale" are not present in the returned state dict (i.e.,
omitted so downstream will default to 1.0) rather than being computed, mirroring
the existing calibrated-case tests that include _amax keys.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: beab94cb-122c-4edd-94aa-71aa4021788c
📒 Files selected for processing (1)
tests/gpu/torch/export/test_export.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
e30aded to
c2e91ed
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
| """, | ||
| ) | ||
|
|
||
| constant_amax: float | None = ModeloptField( |
There was a problem hiding this comment.
@realAsma please share your opinion on this one. My opinion is that we should not introduce this.
There was a problem hiding this comment.
@cjluo-nv what would you suggest instead? I thought we wanted a way to specify whether to use constant amax or calibrated amax
There was a problem hiding this comment.
Could you please give a few suggestions to do this?
There was a problem hiding this comment.
My thinking is that we should have a way to simulate the inference behavior of dummy scales, so adding this option here.
examples/llm_ptq/hf_ptq.py
Outdated
| ), | ||
| ) | ||
| parser.add_argument( | ||
| "--calibrate_kv_cache", |
There was a problem hiding this comment.
can we just use kv_cache_qformat? none means do not calibrate and we will also not export scales. We can validate if TRTLLM, vllm, sglang will enable FP8 kv cache by default
There was a problem hiding this comment.
Is this addressed the latest patch?
There was a problem hiding this comment.
Pull request overview
This PR changes the default KV cache quantization behavior in hf_ptq.py. Previously, FP8 KV cache quantization was the default and required data-driven calibration. Now, KV cache quantization defaults to none, and when enabled, uses a constant scale of 1.0 (amax=448.0) without calibration by default. A new --calibrate_kv_cache flag opts into the previous data-driven calibration behavior. The implementation adds a constant_amax field to QuantizerAttributeConfig that allows quantizers to skip calibration entirely and use a fixed amax value.
Changes:
- Added
constant_amaxfield toQuantizerAttributeConfigand integrated it intoTensorQuantizer._get_amax()andenable_stats_collection()to skip calibration for constant-amax quantizers. - Changed
--kv_cache_qformatdefault fromfp8tonone, added--calibrate_kv_cacheflag, and removed the KV scale floor/clamp logic in the export path. - Added unit tests for
constant_amaxbehavior and updated export tests to match the removed clamping.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
modelopt/torch/quantization/config.py |
Added constant_amax field to QuantizerAttributeConfig |
modelopt/torch/quantization/nn/modules/tensor_quantizer.py |
Added constant_amax to attribute setter mapping and _get_amax short-circuit |
modelopt/torch/quantization/model_calib.py |
Skip calibration for constant_amax quantizers in enable_stats_collection |
modelopt/torch/export/quant_utils.py |
Handle None scaling factors; remove KV scale floor/clamp |
examples/llm_ptq/hf_ptq.py |
Changed defaults, added --calibrate_kv_cache, constant_amax injection logic |
tests/_test_utils/torch/quantization/tensor_quantizer_common.py |
Added tests for constant_amax behavior |
tests/gpu/torch/export/test_export.py |
Updated expected values after removing KV scale clamping |
CHANGELOG.rst |
Documented new features and changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Edwardf0t1
left a comment
There was a problem hiding this comment.
I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.
@Edwardf0t1 , yes that's correct. An example would be Nemotron models. All KV are actually clamped to 1.0 |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
What does this PR do?
Type of change: New feature
By default, FP8 KV cache quantization in hf_ptq.py now uses a constant scale of 1.0 (amax=448.0) without a data-driven calibration pass. No KV scales are written to the exported checkpoint — inference engines (TRT-LLM, vLLM) use scale=1.0 when no scale is
present, so this is lossless. Pass --calibrate_kv_cache to opt into data-driven per-tensor KV scale calibration (previous default behavior).
To support this cleanly in the quantization stack, a constant_amax field is added to QuantizerAttributeConfig. Quantizers configured with constant_amax skip calibration entirely (no forward pass needed), use the fixed amax during fake-quant, and produce no
_amax buffer in the state dict.
Usage
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Bug Fixes
Tests