fix: cast aq score to float64 to avoid overflow by Fridah-nv · Pull Request #1126 · NVIDIA/Model-Optimizer

Fridah-nv · 2026-03-26T10:18:19Z

What does this PR do?

Type of change: Bug fix

Mamba mixer layers produce gradients with magnitudes up to ~1e22. Squaring these in float32 caused overflow to inf.
Fix: Multiply grad * diff first in float32, then cast to float64 only for the square + sum step to minimize performance overhead.

Usage

# Add a code snippet demonstrating how to use this

Testing

python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path nvidia/Nemotron-H-4B-Base-8K --qformat nvfp4_mse,fp8 --calib_size 64 --export_path ./output/nemotron-h-4b-fp8 --trust_remote_code --dataset cnn_dailymail --auto_quantize_bits 4.75

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

github-actions · 2026-03-26T10:22:45Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-03-28 19:54 UTC

codecov · 2026-03-26T10:30:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.16%. Comparing base (b1f9f01) to head (9407536).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1126   +/-   ##
=======================================
  Coverage   70.16%   70.16%           
=======================================
  Files         229      229           
  Lines       26008    26009    +1     
=======================================
+ Hits        18248    18249    +1     
  Misses       7760     7760

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

### What does this PR do? Type of change: Bug fix  Mamba mixer layers produce gradients with magnitudes up to ~1e22. Squaring these in float32 caused overflow to inf. Fix: Multiply grad * diff first in float32, then cast to float64 only for the square + sum step to minimize performance overhead. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ``` python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path nvidia/Nemotron-H-4B-Base-8K --qformat nvfp4_mse,fp8 --calib_size 64 --export_path ./output/nemotron-h-4b-fp8 --trust_remote_code --dataset cnn_dailymail --auto_quantize_bits 4.75 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information  Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

cast aq score to float64 to avoid overflow

9407536

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv requested a review from a team as a code owner March 26, 2026 10:18

Fridah-nv requested review from realAsma and shengliangxu March 26, 2026 10:18

kevalmorabia97 approved these changes Mar 28, 2026

View reviewed changes

kevalmorabia97 merged commit 610707a into main Mar 28, 2026
58 of 60 checks passed

kevalmorabia97 deleted the fridah/fix-aq-overflow branch March 28, 2026 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cast aq score to float64 to avoid overflow#1126

fix: cast aq score to float64 to avoid overflow#1126
kevalmorabia97 merged 1 commit intomainfrom
fridah/fix-aq-overflow

Fridah-nv commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fridah-nv commented Mar 26, 2026

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 26, 2026 •

edited

Loading

codecov bot commented Mar 26, 2026 •

edited

Loading