Skip-Softmax calibration in vLLM by kaix-nv · Pull Request #1622 · NVIDIA/Model-Optimizer

kaix-nv · 2026-06-03T19:50:21Z

What does this PR do?

Type of change: ?

New feature.

Adds skip-softmax calibration support for vLLM, so the calibration measures the sparsity the model will actually exhibit at serve time. It reuses the ModelOpt Triton calibration kernel over vLLM's paged KV cache and supports both the FlashAttention and FlashInfer backends. Also fixes a padding-row bug in the PyTorch flash_skip_softmax calibration that made it disagree with the Triton kernel.

Usage

# Add a code snippet demonstrating how to use this
python examples/vllm_serve/calibrate_sparse_attn.py <CKPT> \
      --prompts_file prompts.txt --target_sparse_ratio 0.5 \
      --decode_tokens 32 --attention_backend FLASHINFER \
      --update_checkpoint_config

python examples/vllm_serve/vllm_serve_sparse_attn.py <CKPT> --enforce-eager  -tp 8

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

…PyTorch Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

copy-pr-bot · 2026-06-03T19:50:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-03T19:50:28Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0c91e644-6dbd-4a68-920f-e3b943973bcc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/vllm_skip_calib

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Kai Xu <kaix@nvidia.com>

…ation Signed-off-by: Kai Xu <kaix@nvidia.com>

Signed-off-by: Kai Xu <kaix@nvidia.com>

rohansjoshi and others added 6 commits June 2, 2026 22:38

First commit

a5772b3

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

Added decode calibration

8490c42

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

Fix decode calibration: full-cache kv_bound + 128x128 block to match …

f092f8c

…PyTorch Signed-off-by: Kai Xu <kaix@nvidia.com>

Fix decode calibration: padded row in decode

d847f63

Signed-off-by: Kai Xu <kaix@nvidia.com>

Apply per-phase calibrated skip threshold at HF inference

419aca1

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add sink-pattern decode calibration test (full cache + nonzero sparsity)

61dc593

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv changed the title ~~Attn-QATKaix/vllm skip calib~~ vllm skip calib Jun 3, 2026

kaix-nv added 6 commits June 3, 2026 14:16

Calibrate skip-softmax thresholds through the vLLM integration

efbcfe3

Signed-off-by: Kai Xu <kaix@nvidia.com>

Cross-validate vLLM/Triton skip-softmax calibration against PyTorch

df73cbf

Signed-off-by: Kai Xu <kaix@nvidia.com>

Support FlashInfer backend for skip-softmax calibration

88a0602

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add FlashInfer-vs-PyTorch calibration regression test

605fd0f

Signed-off-by: Kai Xu <kaix@nvidia.com>

Fix FlashInfer metadata-builder patch and verify it is used in calibr…

a7a747c

…ation Signed-off-by: Kai Xu <kaix@nvidia.com>

Fix flash_skip_softmax padding-row bug; reconcile with Triton kernel

d2bf1c1

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/vllm_skip_calib branch from 22bbfe8 to d2bf1c1 Compare June 3, 2026 21:20

kaix-nv changed the title ~~vllm skip calib~~ Skip-Softmax calibration in vLLM Jun 3, 2026

kaix-nv added 2 commits June 3, 2026 14:32

Report per-sample sparsity ratios in skip-softmax calibration

d7cae16

Signed-off-by: Kai Xu <kaix@nvidia.com>

Support FlashInfer backend for sparse-attention serving

ca9822d

Signed-off-by: Kai Xu <kaix@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip-Softmax calibration in vLLM#1622

Skip-Softmax calibration in vLLM#1622
kaix-nv wants to merge 14 commits into
mainfrom
kaix/vllm_skip_calib

kaix-nv commented Jun 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaix-nv commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaix-nv commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading