Skip to content

[OMNIML-3277] Update kv cache behavior#1012

Open
meenchen wants to merge 14 commits intomainfrom
weimingc/update_kv_cache_behavior
Open

[OMNIML-3277] Update kv cache behavior#1012
meenchen wants to merge 14 commits intomainfrom
weimingc/update_kv_cache_behavior

Conversation

@meenchen
Copy link
Contributor

@meenchen meenchen commented Mar 9, 2026

What does this PR do?

Type of change: New feature

By default, FP8 KV cache quantization in hf_ptq.py now uses a constant scale of 1.0 (amax=448.0) without a data-driven calibration pass. No KV scales are written to the exported checkpoint — inference engines (TRT-LLM, vLLM) use scale=1.0 when no scale is
present, so this is lossless. Pass --calibrate_kv_cache to opt into data-driven per-tensor KV scale calibration (previous default behavior).

To support this cleanly in the quantization stack, a constant_amax field is added to QuantizerAttributeConfig. Quantizers configured with constant_amax skip calibration entirely (no forward pass needed), use the fixed amax during fake-quant, and produce no
_amax buffer in the state dict.

Usage

  # Quantize with default constant KV scale (no calibration pass for KV)
  python hf_ptq.py --model ... --qformat fp8 --kv_cache_qformat fp8

  # Opt into data-driven KV calibration
  python hf_ptq.py --model ... --qformat fp8 --kv_cache_qformat fp8 --calibrate_kv_cache

  # Use constant_amax in a custom quant config
  quant_cfg = {
      "quant_cfg": {
          "*[kv]_bmm_quantizer": {"num_bits": (4, 3), "enable": True, "constant_amax": 448.0},
      },
      "algorithm": "max",
  }
  model = mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)

Testing

  • All 8 existing GPU HF export tests pass (tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py)
  • Two new CPU unit tests added to TensorQuantizerTester: test_constant_amax and test_constant_amax_skips_calibration

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ (--calibrate_kv_cache flag defaults to False; existing scripts without the flag now skip KV calibration)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅

Additional Information

Summary by CodeRabbit

  • New Features

    • New CLI flag --calibrate_kv_cache to optionally enable data-driven KV cache calibration; default uses a fixed KV scale and omits KV scales from exported checkpoints.
    • Added constant_amax option to set fixed quantizer scales and skip dynamic calibration for configured quantizers.
  • Bug Fixes

    • Removed forced flooring/clamp of KV cache scales; out-of-range activations now emit a shorter warning.
  • Tests

    • Added tests for constant_amax behavior and calibration interaction.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds CLI flag --calibrate_kv_cache to choose between fixed or data-driven FP8 KV cache quantization (default fixed amax=448.0). Introduces constant_amax on quantizers to set a fixed amax and skip calibration when present; adjusts calibration flow, export postprocessing, and tests accordingly.

Changes

Cohort / File(s) Summary
Changelog & Config
CHANGELOG.rst, modelopt/torch/quantization/config.py
Documented new flag and added `constant_amax: float
CLI / PTQ flow
examples/llm_ptq/hf_ptq.py
Added --calibrate_kv_cache flag, deepcopy KV quantizer configs to avoid in-place mutation, conditional KV handling: set constant_amax=448.0 and skip KV calibration when disabled, else run KV-only calibration. Updated kv_cache_qformat help and arg validation.
Quantizer runtime / calibration
modelopt/torch/quantization/nn/modules/tensor_quantizer.py, modelopt/torch/quantization/model_calib.py
Added internal _constant_amax via config setter; _get_amax returns constant amax when present. enable_stats_collection skips calibration for quantizers with constant amax and retains them quantized.
Export / postprocess
modelopt/torch/export/quant_utils.py
Skip processing when a KV scaling factor is None; removed clamp-to-1.0 flooring for KV scales and shortened warning message. Preserve computed KV scales when present.
Tests
tests/_test_utils/torch/quantization/tensor_quantizer_common.py, tests/gpu/torch/export/test_export.py
Added tests test_constant_amax and test_constant_amax_skips_calibration. Updated KV FP8 postprocessing expectations (k_scale changed from 1.0 to 0.001) in two export tests.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI
    participant HF as hf_ptq.py
    participant Config as QuantizerConfig
    participant Calib as CalibrationEngine
    participant Export as Exporter

    User->>HF: run with/without --calibrate_kv_cache
    HF->>Config: derive KV quantizer configs (deepcopy)
    alt calibrate_kv_cache = false
        HF->>Config: set constant_amax = 448.0 for KV quantizers
        HF->>Calib: skip KV calibration, enable calibration for other quantizers
    else calibrate_kv_cache = true
        HF->>Calib: enable calibration for KV quantizers only (disable others)
    end
    Calib->>Config: collect stats / compute scales
    HF->>Export: postprocess_state_dict (write KV scales if calibrated)
    Export->>User: export checkpoint (KV scales present when calibrated)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[OMNIML-3277] Update kv cache behavior' is directly related to the changeset, which centers on modifying KV cache quantization behavior through a new constant_amax mechanism and calibration flag.
Security Anti-Patterns ✅ Passed PR introduces KV cache quantization changes and constant_amax field with no security anti-patterns found in SECURITY.md review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch weimingc/update_kv_cache_behavior
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@meenchen meenchen self-assigned this Mar 10, 2026
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/update_kv_cache_behavior branch from edb0827 to 4176c6d Compare March 10, 2026 07:22
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 70.08%. Comparing base (5d0e012) to head (d7fe11c).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1012      +/-   ##
==========================================
- Coverage   71.73%   70.08%   -1.66%     
==========================================
  Files         211      221      +10     
  Lines       23948    25504    +1556     
==========================================
+ Hits        17180    17874     +694     
- Misses       6768     7630     +862     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meenchen meenchen marked this pull request as ready for review March 10, 2026 17:34
@meenchen meenchen requested review from a team as code owners March 10, 2026 17:34
@meenchen meenchen requested a review from sugunav14 March 10, 2026 17:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)

304-323: Consider using defensive .pop() to avoid potential KeyError.

Line 311 uses kv_cache_quant_cfg.pop("default") which will raise KeyError if the "default" key doesn't exist. While current KV configs should have this key, using a defensive approach would be more robust:

-        kv_cache_quant_cfg.pop("default")  # keep other quantizers from auto_quantize
+        kv_cache_quant_cfg.pop("default", None)  # keep other quantizers from auto_quantize

The overall KV cache calibration logic looks correct:

  • When calibrate_kv_cache=False: Uses constant amax=448.0 (scale=1.0)
  • When calibrate_kv_cache=True: Runs data-driven calibration on KV quantizers only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 304 - 323, Replace the direct
pop("default") call with a defensive removal to avoid KeyError: check for the
key or use a safe-pop pattern on kv_cache_quant_cfg (the dict built from
getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before
proceeding; keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 946-949: The current assignment to
quant_cfg["quant_cfg"]["*[kv]_bmm_quantizer"]["constant_amax"] can KeyError for
nvfp4_rotate because that format uses separate "*q_bmm_quantizer",
"*k_bmm_quantizer", "*v_bmm_quantizer" keys; update the branch that runs when
args.kv_cache_qformat != "none" and not args.calibrate_kv_cache to detect
args.kv_cache_qformat == "nvfp4_rotate" and in that case set "constant_amax" on
each of quant_cfg["quant_cfg"]["*q_bmm_quantizer"], "*k_bmm_quantizer" and
"*v_bmm_quantizer" (checking each key exists before assignment), otherwise keep
the existing assignment to "*[kv]_bmm_quantizer" (also guarding existence) so no
KeyError occurs.

In `@modelopt/torch/quantization/config.py`:
- Around line 1033-1043: The new constant_amax config must be validated to be a
finite, strictly positive number to prevent downstream broken scales; update the
ModeloptField for constant_amax so that its validator rejects None? (keep None
allowed) and any values that are <= 0, NaN, or infinite, and raise a clear
validation error. Specifically, add a validation callback or constraint on the
constant_amax ModeloptField declaration (symbol: constant_amax) that uses
math.isfinite(value) and value > 0, and ensure TensorQuantizer._get_amax() can
continue to trust the validated value without additional checks. Ensure the
error message references constant_amax and explains it must be a finite positive
number.

In `@modelopt/torch/quantization/model_calib.py`:
- Around line 709-712: The current early `continue` when module._constant_amax
is set skips updating per-module calibration flags and so prevents static-bias
calibrators from collecting bias stats and also leaves stale _if_quant/_if_calib
state; change the branch in model_calib.py so that when getattr(module,
"_constant_amax", None) is not None you do not unconditionally continue but
instead: if module._calibrator indicates a static-bias calibrator (or if
module._calibrator is not None and supports bias calibration) set
module._if_calib = True (and ensure module._if_quant is set/cleared
consistently) so load_calib_bias()/compute_bias() can run, otherwise explicitly
clear module._if_calib and module._if_quant to avoid preserving stale state;
reference the attributes _constant_amax, _calibrator, _if_calib, _if_quant and
the function finish_stats_collection()/load_calib_bias()/compute_bias() when
making the change.

In `@modelopt/torch/quantization/nn/modules/tensor_quantizer.py`:
- Around line 617-618: get_kv_cache_scaling_factor currently assumes every entry
from get_scaling_factor() is non-None and calls factor.item(), which crashes
when a KV quantizer uses constant_amax (export_amax has no _amax buffer and
get_scaling_factor returns None). Fix by guarding against None: in
get_kv_cache_scaling_factor (the loop that iterates scaling_factors when dtype
== KV_CACHE_FP8), either filter out None values before iterating or add an if
factor is None: continue check before calling factor.item(); ensure this
respects constant_amax/_constant_amax semantics so fixed-amax quantizers are
skipped rather than dereferenced.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 304-323: Replace the direct pop("default") call with a defensive
removal to avoid KeyError: check for the key or use a safe-pop pattern on
kv_cache_quant_cfg (the dict built from getattr(mtq,
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]) before proceeding;
keep the rest of the KV cache path the same (including setting
kv_cache_quant_cfg["*[kv]_bmm_quantizer"]["constant_amax"] when not
args.calibrate_kv_cache and using mtq.set_quantizer_by_cfg /
mtq.set_quantizer_by_cfg_context with language_model and mtq.calibrate when
args.calibrate_kv_cache).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e7b126c8-ce02-4772-8f0c-d36987d90865

📥 Commits

Reviewing files that changed from the base of the PR and between a56b6f3 and 4176c6d.

📒 Files selected for processing (7)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • tests/_test_utils/torch/quantization/tensor_quantizer_common.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/update_kv_cache_behavior branch from bde9317 to ba93ebd Compare March 10, 2026 18:47
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 313-323: The nvfp4_rotate KV format is left uncalibrated because
the code only sets constant_amax when "*[kv]_bmm_quantizer" is present and only
runs calibration when args.calibrate_kv_cache is True; fix by detecting the
nvfp4_rotate KV format and forcing KV calibration or rejecting the combination:
either set args.calibrate_kv_cache = True when the parsed KV format equals
"nvfp4_rotate" before you build kv_cache_quant_cfg (or immediately before
calling mtq.set_quantizer_by_cfg/mtq.calibrate), or validate arguments earlier
and raise an error if --kv_cache_qformat nvfp4_rotate is used without
--calibrate_kv_cache; apply the same change at the other similar spots that call
mtq.set_quantizer_by_cfg, mtq.set_quantizer_by_cfg_context, and mtq.calibrate so
nvfp4_rotate never runs on the uncalibrated default path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5aaec202-14ac-4b91-92cf-eea7abee0dd2

📥 Commits

Reviewing files that changed from the base of the PR and between 4176c6d and ba93ebd.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 964-975: The current shortcut unconditionally sets constant_amax
to 448.0 for "*[kv]_bmm_quantizer" which is valid only for FP8 E4M3; change the
logic in the block that references kv_quantizer_cfg, args.kv_cache_qformat, and
args.calibrate_kv_cache so that you either (a) only apply the constant_amax
shortcut when the quantizer format is FP8 (detect via format identifier or
kv_quantizer_cfg properties), or (b) compute constant_amax from the quantizer’s
configured maxbound (read the quantizer format’s maxbound property from
kv_quantizer_cfg) and set constant_amax = maxbound instead of the hardcoded
448.0; ensure you deepcopy quant_cfg before mutating "*[kv]_bmm_quantizer" as
already done.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 72a1fc78-55a0-4637-b95d-7e2673599983

📥 Commits

Reviewing files that changed from the base of the PR and between 5614009 and 41cba84.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/update_kv_cache_behavior branch from f238de1 to 0a0439c Compare March 11, 2026 04:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/gpu/torch/export/test_export.py (1)

211-225: ⚠️ Potential issue | 🟡 Minor

Cover the new default “no KV scale emitted” branch.

These updated assertions only validate the calibrated path (_amax present → exported scale = amax / maxbound). The PR’s default behavior is the opposite branch: with constant_amax, postprocess_state_dict() should omit k_scale / v_scale entirely so downstream falls back to 1.0. Please add a KV_CACHE_FP8 case with no k_bmm_quantizer._amax / v_bmm_quantizer._amax entries; otherwise the main behavior change can regress unnoticed.

Suggested test addition
 `@pytest.mark.parametrize`(
     ("state_dict", "quantization", "maxbound", "expected_state_dict"),
     [
+        (  # Default constant-amax KV cache path should emit no KV scales
+            {
+                "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),
+            },
+            KV_CACHE_FP8,
+            128.0,
+            {
+                "layer1.pre_quant_scale": torch.tensor([0.128]),
+            },
+        ),
         (  # Test replacements and KV cache scaling
             {
                 "layer1.k_bmm_quantizer._amax": torch.tensor([0.128]),
                 "layer1.v_bmm_quantizer._amax": torch.tensor([256.0]),
                 "layer1.input_quantizer._pre_quant_scale": torch.tensor([0.128]),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_export.py` around lines 211 - 225, Add a test
that covers the default "no KV scale emitted" branch by invoking
postprocess_state_dict() with KV_CACHE_FP8 and constant_amax but without
providing "layer1.k_bmm_quantizer._amax" or "layer1.v_bmm_quantizer._amax" keys;
assert that "layer1.k_proj.k_scale" and "layer1.v_proj.v_scale" are not present
in the returned state dict (i.e., omitted so downstream will default to 1.0)
rather than being computed, mirroring the existing calibrated-case tests that
include _amax keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/gpu/torch/export/test_export.py`:
- Around line 211-225: Add a test that covers the default "no KV scale emitted"
branch by invoking postprocess_state_dict() with KV_CACHE_FP8 and constant_amax
but without providing "layer1.k_bmm_quantizer._amax" or
"layer1.v_bmm_quantizer._amax" keys; assert that "layer1.k_proj.k_scale" and
"layer1.v_proj.v_scale" are not present in the returned state dict (i.e.,
omitted so downstream will default to 1.0) rather than being computed, mirroring
the existing calibrated-case tests that include _amax keys.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: beab94cb-122c-4edd-94aa-71aa4021788c

📥 Commits

Reviewing files that changed from the base of the PR and between 41cba84 and 0a0439c.

📒 Files selected for processing (1)
  • tests/gpu/torch/export/test_export.py

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/update_kv_cache_behavior branch from e30aded to c2e91ed Compare March 11, 2026 18:05
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
""",
)

constant_amax: float | None = ModeloptField(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma please share your opinion on this one. My opinion is that we should not introduce this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv what would you suggest instead? I thought we wanted a way to specify whether to use constant amax or calibrated amax

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give a few suggestions to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that we should have a way to simulate the inference behavior of dummy scales, so adding this option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma, after syncing with @cjluo-nv offline, I changed this to cast_to_fp8, a flag to toggle instead of a constant value that users must set. wdyt?

),
)
parser.add_argument(
"--calibrate_kv_cache",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use kv_cache_qformat? none means do not calibrate and we will also not export scales. We can validate if TRTLLM, vllm, sglang will enable FP8 kv cache by default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this addressed the latest patch?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default KV cache quantization behavior in hf_ptq.py. Previously, FP8 KV cache quantization was the default and required data-driven calibration. Now, KV cache quantization defaults to none, and when enabled, uses a constant scale of 1.0 (amax=448.0) without calibration by default. A new --calibrate_kv_cache flag opts into the previous data-driven calibration behavior. The implementation adds a constant_amax field to QuantizerAttributeConfig that allows quantizers to skip calibration entirely and use a fixed amax value.

Changes:

  • Added constant_amax field to QuantizerAttributeConfig and integrated it into TensorQuantizer._get_amax() and enable_stats_collection() to skip calibration for constant-amax quantizers.
  • Changed --kv_cache_qformat default from fp8 to none, added --calibrate_kv_cache flag, and removed the KV scale floor/clamp logic in the export path.
  • Added unit tests for constant_amax behavior and updated export tests to match the removed clamping.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
modelopt/torch/quantization/config.py Added constant_amax field to QuantizerAttributeConfig
modelopt/torch/quantization/nn/modules/tensor_quantizer.py Added constant_amax to attribute setter mapping and _get_amax short-circuit
modelopt/torch/quantization/model_calib.py Skip calibration for constant_amax quantizers in enable_stats_collection
modelopt/torch/export/quant_utils.py Handle None scaling factors; remove KV scale floor/clamp
examples/llm_ptq/hf_ptq.py Changed defaults, added --calibrate_kv_cache, constant_amax injection logic
tests/_test_utils/torch/quantization/tensor_quantizer_common.py Added tests for constant_amax behavior
tests/gpu/torch/export/test_export.py Updated expected values after removing KV scale clamping
CHANGELOG.rst Documented new features and changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen
Copy link
Contributor Author

I think currently by default we still calibrate kv cache for fp8, right? But in the export stage, most values are clamped to 1.0.

@Edwardf0t1 , yes that's correct. An example would be Nemotron models. All KV are actually clamped to 1.0

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen requested review from Edwardf0t1 and cjluo-nv March 13, 2026 17:17
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants