-
Notifications
You must be signed in to change notification settings - Fork 228
[OMNIML-2850] [3/n] Adds sparse attention calibration #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #538 +/- ##
==========================================
+ Coverage 74.69% 74.70% +0.01%
==========================================
Files 192 198 +6
Lines 18948 19691 +743
==========================================
+ Hits 14153 14711 +558
- Misses 4795 4980 +185 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8c7ee86 to
da6f627
Compare
525a119 to
c9d7008
Compare
7727793 to
2864629
Compare
2864629 to
ca7e24e
Compare
|
@kaix-nv github is showing 7000+ lines of code as part of this PR. Is that accurate? |
3474b6f to
74a29ea
Compare
0aa51a5 to
2101e99
Compare
2101e99 to
80e0196
Compare
80e0196 to
010e86a
Compare
|
I suggest keep just the calibration core logic + the unit tests for calibration in this PR. Make a 4/n pr for the rest. |
| ) | ||
|
|
||
| max_seqlen: int = ModeloptField( | ||
| default=32768, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we support 32K sequence with eager attention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve added chunked prefill to support long sequence lengths. The default chunk_size is set to 2048, and users can reduce chunk_size if they encounter OOM issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For decoding, FlashAttention is used during prefill, which also supports long sequence lengths.
| config: Sparse attention configuration with calibration settings | ||
| forward_loop: Optional callable that forwards calibration data through the model. | ||
| If provided, uses this for calibration data. | ||
| If None, will auto-generate RULER dataset for calibration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the auto generation work in general like Megatron Core models?
| samples: int = ModeloptField( | ||
| default=24, | ||
| title="Calibration samples", | ||
| description="Total number of RULER samples for calibration (distributed across length bins).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we hard coding the data set here??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a setup-dependent magic number. The default calibration settings are:
num_length_bins = 4
num_tasks = 6
which results in a minimum of 24 data samples.
I’ve updated the description to explain the details.
| # Configuration with RULER calibration | ||
| # Note: threshold field is omitted - calibration determines dynamic threshold λ = a / length | ||
| # The calibrated threshold adapts to sequence length for optimal sparsity | ||
| SKIP_SOFTMAX_CALIB = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this config work for Megatron? Otherwise I suggest removing the dataset specific config - dataset and num_samples should be handled by example files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the dataset specific configs are framework-agnostic, they describe what calibration data to generate, not how to generate it. Both HF and Megatron helpers can consume these parameters to generate appropriate calibration data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megatron support (e.g., adding forward_loop) will be added in future PRs.
examples/llm_eval/mmlu.py
Outdated
| # Force eager attention if sparse attention is requested | ||
| if sparse_cfg: | ||
| kwargs["attn_implementation"] = "eager" | ||
| warnings.warn( | ||
| "Sparse attention requires attn_implementation='eager'. " | ||
| "Forcing eager attention implementation." | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can e move this to sparsity/plugins/hugginface ? We should detect if this is a HF model and if yes apply this (see
| AutoQuantizeGradientSearcher.register_custom_support( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this check to hugginface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as https://github.com/NVIDIA/Model-Optimizer/pull/538/files#r2646356349 and avoid repeated attention modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this check.
| from .dataset import RulerDatasetBuilder | ||
|
|
||
|
|
||
| def _extract_tokenizer_from_model(model: nn.Module) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works only work Huggingface transformers, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Megatron support (e.g., adding forward_loop) will be added in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this sparsity/plugins?
Otherwise this change will make transformers library a required dependency of ModelOpt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use local imports of third party libraries wherever necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve switched to local imports. Since most of the dataset construction logic can be reused by Megatron, I’d prefer to keep this file under the calibration folder.
| # For causal attention, only count lower triangle blocks (including diagonal) | ||
| num_causal_blocks = num_block_rows * (2 * num_block_cols - num_block_rows + 1) // 2 | ||
| total_valid_blocks = batch_size * num_heads * num_causal_blocks | ||
| density = float(block_mask.sum()) / total_valid_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep this as a torch tensor? a float(tensor_in_gpu) causes unneseccary CPU-GPU sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It’s been fixed.
| density = float(block_mask.sum()) / total_valid_blocks | ||
| total_blocks = num_causal_blocks | ||
| else: | ||
| density = float(block_mask.sum() / block_mask.numel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s been fixed.
jy-yuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the PR, and the calibration logic looks correct (dynamic thresholding via regression and phase separation seems solid). I tested the code and verified that the unit tests pass locally.
One small observation regarding dependencies: I noticed I had to manually install nltk and wonderwords to run the tests/calibration. It seems they are currently added to dev-test in setup.py, so they aren't included in pip install nvidia-modelopt[all]. If the RULER-based calibration is intended to be a supported feature for users (i.e., when they don't provide a custom forward loop), we might want to consider moving these to a user-facing optional dependency group (like calibration or hf) or catching the ModuleNotFoundError to suggest installation.
c1692d4 to
e65d3e1
Compare
Thanks Jiayi. Good catch. I’ve moved the dependency to the |
Signed-off-by: Kai Xu <[email protected]>
Signed-off-by: Kai Xu <[email protected]>
Signed-off-by: Kai Xu <[email protected]>
7fc092b to
30a9794
Compare
Signed-off-by: Kai Xu <[email protected]>
30a9794 to
ed213d9
Compare
What does this PR do?
Type of change: ?
new feature
Overview: ?
Usage
Testing
The calibration results for
Qwen/Qwen3-30B-A3B-Thinking-2507are shown below and are consistent with our offline calibration results.Before your PR is "Ready for review"
Additional Information