feat[gpu]: dyn dispatch patches infrastructure#7431
Draft
Conversation
Support patched BitPacked arrays in both output and input stages of the fused dynamic dispatch kernel. Previously, patched BitPacked in input stages (dict values, runend ends/values) was demoted to a pending subtree requiring a separate kernel launch. Plan builder cleanup: - Remove allow_bp_patches bool threaded through every walk method. - Add push_pending helper (deduplicates unfusable subtree recording). - Replace walk_mixed_width_child with walk_child. - walk_dict and walk_runend use walk_child for all children. Kernel: - Add unpack_source_patches, apply_source_patches_chunk (output stage), and apply_source_patches_all (input stage) helpers. - Per-stage patches_ptr on PackedStage (single u64, 0 = none). - Patches packed into a single device buffer [lane_offsets|indices|values] during materialization via pack_patches_for_fused. ALP patches are unchanged (still applied post-kernel via the existing separate scatter in hybrid_dispatch). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Narrow field types and reorder struct fields by decreasing alignment to minimize padding in the GPU dispatch plan wire format: - SourceOpCode/ScalarOpCode: int → uint8_t - RunEndParams: num_runs/offset u64 → u32, smem offsets u32 → u16 - BitunpackParams: element_offset u32 → u16 - DictParams: values_smem_byte_offset u32 → u16 - PackedStage/Stage: smem_byte_offset u32 → u16 - Reorder all struct fields largest-first to eliminate padding Resulting sizes (nvcc, naturally aligned, no packing): SourceOp: 32 → 24 bytes PackedStage: 56 → 48 bytes Stage: 80 → 56 bytes Add bounds checks in plan_builder for the narrowed u16/u32 fields. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Add test_bitpacked_with_patches and test_for_bitpacked_with_patches to verify that the fused dynamic dispatch kernel correctly applies source patches (BitPacked exceptions) in both standalone and FoR(BitPacked) configurations. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BitPacked with patches is not yet applied in fused dispatch. Restore the patches().is_none() check and remove premature tests. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9579b22 to
771d471
Compare
0ax1
commented
Apr 14, 2026
| /// element type. | ||
| struct DictParams { | ||
| uint32_t values_smem_byte_offset; // byte offset to decoded dict values in smem | ||
| uint16_t values_smem_byte_offset; // byte offset to decoded dict values in smem |
Contributor
Author
There was a problem hiding this comment.
Current max shared memory with guard is 48KB.
0ax1
commented
Apr 14, 2026
| uint64_t input_ptr; // global memory pointer to this stage's encoded input | ||
| uint64_t patches_ptr; // device ptr to packed source patches (0 = none) | ||
| struct SourceOp source; | ||
| uint32_t len; // number of elements this stage produces |
Contributor
Author
There was a problem hiding this comment.
Happy to expand this again to u64, but for now let's assume and guard we stay within u32. If we go past u32, we should also try micro/marco-benchmarking in that range.
0ax1
commented
Apr 14, 2026
| let values_len: u32 = values | ||
| .len() | ||
| .try_into() | ||
| .map_err(|_| vortex_err!("Dict values length {} exceeds u32::MAX", values.len()))?; |
Contributor
Author
There was a problem hiding this comment.
This doesn't fail decompression but leads to falling back to the CPU.
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds patches infra for dyn dispatch, without applying them to bitpacking and alp yet. As part of that, the types used in dyn dispatch are trimmed down in size. This is to not reduce warp occupancy on the SMs due to allocating more registers on the GPU.