Skip to content

feat[gpu]: dyn dispatch patches infrastructure#7431

Draft
0ax1 wants to merge 7 commits intodevelopfrom
ad/cuda-patches-clean
Draft

feat[gpu]: dyn dispatch patches infrastructure#7431
0ax1 wants to merge 7 commits intodevelopfrom
ad/cuda-patches-clean

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented Apr 14, 2026

Adds patches infra for dyn dispatch, without applying them to bitpacking and alp yet. As part of that, the types used in dyn dispatch are trimmed down in size. This is to not reduce warp occupancy on the SMs due to allocating more registers on the GPU.

0ax1 and others added 6 commits April 14, 2026 17:09
Support patched BitPacked arrays in both output and input stages of the
fused dynamic dispatch kernel. Previously, patched BitPacked in input
stages (dict values, runend ends/values) was demoted to a pending
subtree requiring a separate kernel launch.

Plan builder cleanup:
- Remove allow_bp_patches bool threaded through every walk method.
- Add push_pending helper (deduplicates unfusable subtree recording).
- Replace walk_mixed_width_child with walk_child.
- walk_dict and walk_runend use walk_child for all children.

Kernel:
- Add unpack_source_patches, apply_source_patches_chunk (output stage),
  and apply_source_patches_all (input stage) helpers.
- Per-stage patches_ptr on PackedStage (single u64, 0 = none).
- Patches packed into a single device buffer [lane_offsets|indices|values]
  during materialization via pack_patches_for_fused.

ALP patches are unchanged (still applied post-kernel via the existing
separate scatter in hybrid_dispatch).

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Narrow field types and reorder struct fields by decreasing alignment to
minimize padding in the GPU dispatch plan wire format:

- SourceOpCode/ScalarOpCode: int → uint8_t
- RunEndParams: num_runs/offset u64 → u32, smem offsets u32 → u16
- BitunpackParams: element_offset u32 → u16
- DictParams: values_smem_byte_offset u32 → u16
- PackedStage/Stage: smem_byte_offset u32 → u16
- Reorder all struct fields largest-first to eliminate padding

Resulting sizes (nvcc, naturally aligned, no packing):
  SourceOp:   32 → 24 bytes
  PackedStage: 56 → 48 bytes
  Stage:       80 → 56 bytes

Add bounds checks in plan_builder for the narrowed u16/u32 fields.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Add test_bitpacked_with_patches and test_for_bitpacked_with_patches
to verify that the fused dynamic dispatch kernel correctly applies
source patches (BitPacked exceptions) in both standalone and
FoR(BitPacked) configurations.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BitPacked with patches is not yet applied in fused dispatch.
Restore the patches().is_none() check and remove premature tests.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-patches-clean branch from 9579b22 to 771d471 Compare April 14, 2026 17:55
@0ax1 0ax1 requested a review from a10y April 14, 2026 17:55
/// element type.
struct DictParams {
uint32_t values_smem_byte_offset; // byte offset to decoded dict values in smem
uint16_t values_smem_byte_offset; // byte offset to decoded dict values in smem
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current max shared memory with guard is 48KB.

uint64_t input_ptr; // global memory pointer to this stage's encoded input
uint64_t patches_ptr; // device ptr to packed source patches (0 = none)
struct SourceOp source;
uint32_t len; // number of elements this stage produces
Copy link
Copy Markdown
Contributor Author

@0ax1 0ax1 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to expand this again to u64, but for now let's assume and guard we stay within u32. If we go past u32, we should also try micro/marco-benchmarking in that range.

let values_len: u32 = values
.len()
.try_into()
.map_err(|_| vortex_err!("Dict values length {} exceeds u32::MAX", values.len()))?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fail decompression but leads to falling back to the CPU.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 added the changelog/feature A new feature label Apr 14, 2026
@0ax1 0ax1 marked this pull request as draft April 14, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant