Skip to content

Skip-Softmax calibration in vLLM#1622

Draft
kaix-nv wants to merge 14 commits into
mainfrom
kaix/vllm_skip_calib
Draft

Skip-Softmax calibration in vLLM#1622
kaix-nv wants to merge 14 commits into
mainfrom
kaix/vllm_skip_calib

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Jun 3, 2026

What does this PR do?

Type of change: ?

New feature.

Adds skip-softmax calibration support for vLLM, so the calibration measures the sparsity the model will actually exhibit at serve time. It reuses the ModelOpt Triton calibration kernel over vLLM's paged KV cache and supports both the FlashAttention and FlashInfer backends. Also fixes a padding-row bug in the PyTorch flash_skip_softmax calibration that made it disagree with the Triton kernel.

Usage

# Add a code snippet demonstrating how to use this
python examples/vllm_serve/calibrate_sparse_attn.py <CKPT> \
      --prompts_file prompts.txt --target_sparse_ratio 0.5 \
      --decode_tokens 32 --attention_backend FLASHINFER \
      --update_checkpoint_config

python examples/vllm_serve/vllm_serve_sparse_attn.py <CKPT> --enforce-eager  -tp 8

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

rohansjoshi and others added 6 commits June 2, 2026 22:38
Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>
Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>
…PyTorch

Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0c91e644-6dbd-4a68-920f-e3b943973bcc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/vllm_skip_calib

Comment @coderabbitai help to get the list of available commands and usage tips.

@kaix-nv kaix-nv changed the title Attn-QATKaix/vllm skip calib vllm skip calib Jun 3, 2026
@kaix-nv kaix-nv force-pushed the kaix/vllm_skip_calib branch from 22bbfe8 to d2bf1c1 Compare June 3, 2026 21:20
@kaix-nv kaix-nv changed the title vllm skip calib Skip-Softmax calibration in vLLM Jun 3, 2026
kaix-nv added 2 commits June 3, 2026 14:32
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants