-
Notifications
You must be signed in to change notification settings - Fork 39
Fix documentation and references for Flash Sparse Attention #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Exposes backend availability flags to let callers probe supported runtimes without import errors. Provides auto-selection helper to fall back to the first available backend for attention execution.
Introduces a Flex Attention forward path that constructs causal block masks, normalizes mask and bias defaults, and applies compile-friendly kernel options to ease sparse Flash workloads.
- Deleted `modeling_flash_dynamic_mask_attention_utils.py` as it contained redundant code and was not being utilized. - Removed `mask.py` and `padding.py` files which were not necessary for the current implementation, streamlining the codebase.
Adds fused forward/backward kernels in Triton to accelerate sparse attention with masking, bias, and GQA support for PyTorch integration.
Enables calling sparse Flash attention CUDA kernels through custom autograd helpers. Registers fake implementations and padding logic so torch.compile stays compatible with varying head shapes.
Updates package and repo naming so installation commands match the published distribution. Repositions performance benchmarks after usage guidance for both languages and aligns tensor examples to current API expectations.
Aligns packaging metadata with new repository identity.
Clarifies security instructions under the Flash Sparse Attention brand so users follow the right guidance for install, reporting, and support
Aligns packaging metadata and build hooks with the flash_sparse_attn name so prebuilt wheels, env vars, and CUDA builds resolve correctly.
Points contribution guide links at flash-sparse-attention to avoid outdated references.
Reflects updated project title and repository location to keep citation metadata current.
Introduces cached availability checks so integrations can detect flash sparse attention without importing local modules and ensures CUDA backed torch is present before enabling features.
Supports future HF integration by routing calls through flash sparse attention logic and normalizing autocast, causal, and dtype handling
Introduces lazy import plumbing for flash sparse attention kernels to streamline future integrations. Prepares padding-aware helpers and kwarg validation so padding-free flows and PEFT casting stay compatible with the kernels.
Introduces mask utilities for top-k and relu masking to support flash sparse attention. Enables optional block smoothing to stabilize dynamic sparsity patterns.
Introduces reusable padding helpers to consolidate ragged tensor handling and avoid recomputing per layer indices. Addresses static-cache overflow by slicing KV states and provides local indexing to keep graph-friendly.
Points the integration to the renamed sparse attention package so setup guidance stays accurate.
…integration and v1.0.0 technical report. These files have been superseded by updated documentation reflecting recent changes and improvements in the codebase.
Updates API reference to reflect the flash_sparse_attn branding so installation instructions, imports, and backend descriptions stay consistent with the renamed package.
Updates terminology to reflect the flash sparse attention rebranding so readers follow accurate package names, imports, and integration guidance.
Updates benchmark integrations to load the flash_sparse_attn implementations so the renamed package continues to back the CUDA, Triton, and Flex runs. Renames the availability guards and status messages to keep diagnostic output aligned with the new module namespace.
Updates the sparse attention backend to drop the old dynamic mask name so future errors and docs consistently refer to FlashSparseAttention.
Maintains naming consistency after the FSA rebrand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates all documentation, error messages, and references throughout the codebase to reflect the rebranding from "Flash Dynamic Mask Attention" to "Flash Sparse Attention". The changes maintain consistency across code, documentation, and configuration files.
- Updated function/variable naming from
flash_dmattntoflash_sparse_attn - Renamed references in error messages and documentation
- Updated repository URLs and package names
Reviewed Changes
Copilot reviewed 19 out of 337 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| csrc/flash_sparse_attn/src/generate_kernels.py | Updated kernel generation description |
| csrc/flash_sparse_attn/src/flash_fwd_launch_template.h | Updated error message for unsupported architecture |
| csrc/flash_sparse_attn/src/flash_bwd_launch_template.h | Updated error message for unsupported architecture |
| csrc/flash_sparse_attn/flash_api.cpp | Updated error messages and module documentation |
| benchmarks/forward_performance.py | Updated import statements and variable names |
| benchmarks/forward_equivalence.py | Updated import statements and variable names |
| benchmarks/backward_performance.py | Updated import statements and variable names |
| benchmarks/backward_equivalence.py | Updated import statements and variable names |
| SECURITY.md | Updated project name references and URLs |
| README_zh.md | Updated installation instructions and reorganized content |
| README.md | Updated installation instructions and reorganized content |
| CONTRIBUTING.md | Updated repository URLs and references |
| CITATION.cff | Updated title and repository URL |
| .github/workflows/manual_publish.yml | Updated environment variable name |
| .github/workflows/_build.yml | Updated environment variable names |
| .github/ISSUE_TEMPLATE/feature_request.yml | Updated description abbreviation |
| .github/ISSUE_TEMPLATE/feature_request.md | Updated description and context |
| .github/ISSUE_TEMPLATE/bug_report.yml | Updated description abbreviation |
| .github/ISSUE_TEMPLATE/bug_report.md | Updated description and version label |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # 为稀疏注意力创建 bias | ||
| attn_bias = torch.randn(batch_size, num_kv_heads, seq_len, seq_len, device=device, dtype=dtype) | ||
| attn_bias = torch.randn(batch_size, num_kv_heads, 1, seq_len, device=device, dtype=dtype) |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attention bias shape has been changed from (batch_size, num_kv_heads, seq_len, seq_len) to (batch_size, num_kv_heads, 1, seq_len), but this dimension change is not documented anywhere in the PR. This appears to be a functional code change rather than just a documentation/naming update, which contradicts the stated purpose of this PR. This change could break existing code that depends on the original shape and needs proper documentation and migration guidance.
| attn_bias = torch.randn(batch_size, num_kv_heads, 1, seq_len, device=device, dtype=dtype) | |
| attn_bias = torch.randn(batch_size, num_kv_heads, seq_len, seq_len, device=device, dtype=dtype) |
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 19 out of 337 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Root Cause
Changes
Reproduction
Tests
Compatibility
Checklist