Fixes & Simplifications for MCore KVCache QAT/QAD; Unittests; Distributed Sync of KVCache Quantizer params by realAsma · Pull Request #727 · NVIDIA/Model-Optimizer

realAsma · 2025-12-24T18:33:54Z

What does this PR do?

Type of change: Fix MCore KV Cache Quantization: Amax Device Placement Bug; Code clean up; Distributed Sync of KVCache Quantizer params; unittest expansion to hybrid models

Overview: Fixes bugs preventing MCore KV Cache quantization from working during checkpoint restore.

Bug Chain

Bug 1: is_enabled = self.weight_quantizer.is_enabled if hasattr(self, "weight_quantizer") else False

No weight_quantizer for KV-cache-only quant → is_enabled=False → metadata not saved → modelopt_post_restore() never called. (Thanks to @jenchen13 )

Bug 2: After fixing Bug 1, _amax restored on CPU (via _reset_pytorch_state_from_metadata). Fallback _calibrate_quantizers() never called because _amax exists.

Bug 3: Even if called, _calibrate_quantizers() fails — core_attention has no parameters → can't determine device/dtype.

The Fix

Remove is_enabled check entirely — disabled modules may still need metadata restore. Explicitly skip output_layer from extra state callbacks (never quantized)
Set dtype/device on core_attention from parent Attention module, modelopt_post_restore() calls self.to(device, dtype)
Remove dead _calibrate_quantizers() code (will bring back similar logic for KV cache affine quantization)

Previous Unit Test Was Wrong

model_test was mtq.quantize()'d, not mto.restore()'d. Never tested actual restore path.

Additional Fixes

Amax sync across DP/TP for KV cache quantizers
flash_decode auto-disabled

Code Cleanup

Removed ~100 lines of dead code.

Testing

MCore KV Cache QAD with Nano V3 + Context Parallel works
Unit tests: hybrid models, KV+GEMM configs, correct restore workflow, backward pass validation

Before your PR is "Ready for review"

Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No
Did you update Changelog?: Yes

copy-pr-bot · 2025-12-24T18:33:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

updated/cleaned up tests KV Cache clean ups; added MCore hybrid tests Added amax sync for KVCache Quantization minor Signed-off-by: realAsma <akuriparambi@nvidia.com>

codecov · 2026-01-06T17:07:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.69%. Comparing base (3350b0a) to head (abc6e3a).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #727      +/-   ##
==========================================
- Coverage   74.69%   74.69%   -0.01%     
==========================================
  Files         192      192              
  Lines       18946    18953       +7     
==========================================
+ Hits        14152    14156       +4     
- Misses       4794     4797       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

jenchen13

LGTM, thanks so much!

kinjalpatel27

LGTM! Thank you

@jenchen13

…uted Sync of KVCache Quantizer params (#727) ## What does this PR do? **Type of change:** Fix MCore KV Cache Quantization: Amax Device Placement Bug; Code clean up; Distributed Sync of KVCache Quantizer params; unittest expansion to hybrid models **Overview:** Fixes bugs preventing MCore KV Cache quantization from working during checkpoint restore. ### Bug Chain **Bug 1:** `is_enabled = self.weight_quantizer.is_enabled if hasattr(self, "weight_quantizer") else False` No `weight_quantizer` for KV-cache-only quant → `is_enabled=False` → metadata not saved → `modelopt_post_restore()` never called. *(Thanks to @jenchen13 )* **Bug 2:** After fixing Bug 1, `_amax` restored on CPU (via `_reset_pytorch_state_from_metadata`). Fallback `_calibrate_quantizers()` never called because `_amax` exists. **Bug 3:** Even if called, `_calibrate_quantizers()` fails — `core_attention` has no parameters → can't determine device/dtype. ### The Fix 1. Remove `is_enabled` check entirely — disabled modules may still need metadata restore. Explicitly skip `output_layer` from extra state callbacks (never quantized) 2. Set `dtype`/`device` on `core_attention` from parent Attention module, `modelopt_post_restore()` calls `self.to(device, dtype)` 3. Remove dead `_calibrate_quantizers()` code (will bring back similar logic for KV cache affine quantization) ### Previous Unit Test Was Wrong `model_test` was `mtq.quantize()`'d, not `mto.restore()`'d. Never tested actual restore path. ### Additional Fixes - Amax sync across DP/TP for KV cache quantizers - `flash_decode` auto-disabled ### Code Cleanup Removed ~100 lines of dead code. ## Testing 1. MCore KV Cache QAD with Nano V3 + Context Parallel works 2. Unit tests: hybrid models, KV+GEMM configs, correct restore workflow, backward pass validation ## Before your PR is "*Ready for review*" - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No - **Did you update Changelog?**: Yes --------- Signed-off-by: realAsma <akuriparambi@nvidia.com> Co-authored-by: Asma Thekkumpate <akuriparambi@cw-dfw-cs-001-vscode-02.cm.cluster>

realAsma requested a review from a team as a code owner December 24, 2025 18:33

realAsma requested a review from jingyu-ml December 24, 2025 18:33

realAsma requested review from ChenhanYu, jenchen13 and kaix-nv December 24, 2025 18:34

realAsma changed the title ~~Fixes & Simplifications for MCore KVCache QAT/QAD~~ [Draft] Fixes & Simplifications for MCore KVCache QAT/QAD Dec 24, 2025

jenchen13 reviewed Dec 30, 2025

View reviewed changes

Comment thread modelopt/torch/quantization/plugins/megatron.py Outdated

jenchen13 reviewed Jan 6, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/nn/modules/tensor_quantizer.py Outdated

Fixed for MCore KVCache QAD

89f507a

updated/cleaned up tests KV Cache clean ups; added MCore hybrid tests Added amax sync for KVCache Quantization minor Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/MCore_KVCache_fix branch from 249227d to 89f507a Compare January 6, 2026 16:56

realAsma changed the title ~~[Draft] Fixes & Simplifications for MCore KVCache QAT/QAD~~ Fixes & Simplifications for MCore KVCache QAT/QAD; Unittests; Distributed Sync of KVCache Quantizer params Jan 6, 2026

Fix safer memory access for contiguous inputs; update changelog

194edd4

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma requested review from jenchen13, kinjalpatel27 and mxinO January 6, 2026 18:14

Remove is_enabled check; skip output_layer explicitly

a8e92ba

Signed-off-by: realAsma <akuriparambi@nvidia.com>

ChenhanYu reviewed Jan 6, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/plugins/megatron.py

jenchen13 approved these changes Jan 6, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/plugins/megatron.py

kinjalpatel27 reviewed Jan 6, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/model_calib.py Outdated

minor

724451e

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/MCore_KVCache_fix branch from 7c6ea5a to 724451e Compare January 6, 2026 21:12

realAsma added 2 commits January 6, 2026 14:14

Revert sync_amax iterable, clarify flash_decode inference comment

b51d294

Signed-off-by: realAsma <akuriparambi@nvidia.com>

minor

abc6e3a

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma requested review from ChenhanYu and kinjalpatel27 January 6, 2026 22:32

jenchen13 approved these changes Jan 6, 2026

View reviewed changes

Comment thread tests/gpu/torch/quantization/plugins/test_megatron.py

kinjalpatel27 approved these changes Jan 6, 2026

View reviewed changes

realAsma merged commit 81c509c into main Jan 7, 2026
35 checks passed

realAsma deleted the asma/MCore_KVCache_fix branch January 7, 2026 00:08

jenchen13 mentioned this pull request Jan 14, 2026

Fix KV cache quantization bugs #673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes & Simplifications for MCore KVCache QAT/QAD; Unittests; Distributed Sync of KVCache Quantizer params#727

Fixes & Simplifications for MCore KVCache QAT/QAD; Unittests; Distributed Sync of KVCache Quantizer params#727
realAsma merged 6 commits intomainfrom
asma/MCore_KVCache_fix

realAsma commented Dec 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenchen13 left a comment

Uh oh!

Uh oh!

kinjalpatel27 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

realAsma commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Bug Chain

The Fix

Previous Unit Test Was Wrong

Additional Fixes

Code Cleanup

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenchen13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kinjalpatel27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

realAsma commented Dec 24, 2025 •

edited

Loading

codecov Bot commented Jan 6, 2026 •

edited

Loading