feat[gpu]: dyn dispatch patches infrastructure by 0ax1 · Pull Request #7431 · vortex-data/vortex

0ax1 · 2026-04-14T17:41:04Z

Adds patches infra for dyn dispatch, without applying them to bitpacking and alp yet. As part of that, the types used in dyn dispatch are trimmed down in size. This is to not reduce warp occupancy on the SMs due to allocating more registers on the GPU.

Support patched BitPacked arrays in both output and input stages of the fused dynamic dispatch kernel. Previously, patched BitPacked in input stages (dict values, runend ends/values) was demoted to a pending subtree requiring a separate kernel launch. Plan builder cleanup: - Remove allow_bp_patches bool threaded through every walk method. - Add push_pending helper (deduplicates unfusable subtree recording). - Replace walk_mixed_width_child with walk_child. - walk_dict and walk_runend use walk_child for all children. Kernel: - Add unpack_source_patches, apply_source_patches_chunk (output stage), and apply_source_patches_all (input stage) helpers. - Per-stage patches_ptr on PackedStage (single u64, 0 = none). - Patches packed into a single device buffer [lane_offsets|indices|values] during materialization via pack_patches_for_fused. ALP patches are unchanged (still applied post-kernel via the existing separate scatter in hybrid_dispatch). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Narrow field types and reorder struct fields by decreasing alignment to minimize padding in the GPU dispatch plan wire format: - SourceOpCode/ScalarOpCode: int → uint8_t - RunEndParams: num_runs/offset u64 → u32, smem offsets u32 → u16 - BitunpackParams: element_offset u32 → u16 - DictParams: values_smem_byte_offset u32 → u16 - PackedStage/Stage: smem_byte_offset u32 → u16 - Reorder all struct fields largest-first to eliminate padding Resulting sizes (nvcc, naturally aligned, no packing): SourceOp: 32 → 24 bytes PackedStage: 56 → 48 bytes Stage: 80 → 56 bytes Add bounds checks in plan_builder for the narrowed u16/u32 fields. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Add test_bitpacked_with_patches and test_for_bitpacked_with_patches to verify that the fused dynamic dispatch kernel correctly applies source patches (BitPacked exceptions) in both standalone and FoR(BitPacked) configurations. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BitPacked with patches is not yet applied in fused dispatch. Restore the patches().is_none() check and remove premature tests. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 · 2026-04-14T18:01:04Z

    /// element type.
    struct DictParams {
-        uint32_t values_smem_byte_offset; // byte offset to decoded dict values in smem
+        uint16_t values_smem_byte_offset; // byte offset to decoded dict values in smem


Current max shared memory with guard is 48KB.

0ax1 · 2026-04-14T18:02:19Z

+    uint64_t input_ptr;   // global memory pointer to this stage's encoded input
+    uint64_t patches_ptr; // device ptr to packed source patches (0 = none)
    struct SourceOp source;
+    uint32_t len;              // number of elements this stage produces


Happy to expand this again to u64, but for now let's assume and guard we stay within u32. If we go past u32, we should also try micro/marco-benchmarking in that range.

0ax1 · 2026-04-14T18:03:24Z

+        let values_len: u32 = values
+            .len()
+            .try_into()
+            .map_err(|_| vortex_err!("Dict values length {} exceeds u32::MAX", values.len()))?;


This doesn't fail decompression but leads to falling back to the CPU.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 and others added 6 commits April 14, 2026 17:09

style(cuda): clang-format dynamic dispatch files

eb4be24

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

doc

771d471

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 force-pushed the ad/cuda-patches-clean branch from 9579b22 to 771d471 Compare April 14, 2026 17:55

0ax1 requested a review from a10y April 14, 2026 17:55

0ax1 commented Apr 14, 2026

View reviewed changes

can't be 0

3204a32

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added the changelog/feature A new feature label Apr 14, 2026

0ax1 marked this pull request as draft April 14, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[gpu]: dyn dispatch patches infrastructure#7431

feat[gpu]: dyn dispatch patches infrastructure#7431
0ax1 wants to merge 7 commits intodevelopfrom
ad/cuda-patches-clean

0ax1 commented Apr 14, 2026 •

edited

Loading

Uh oh!

0ax1 Apr 14, 2026

Uh oh!

0ax1 Apr 14, 2026 •

edited

Loading

Uh oh!

0ax1 Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0ax1 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0ax1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

0ax1 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ax1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0ax1 commented Apr 14, 2026 •

edited

Loading

0ax1 Apr 14, 2026 •

edited

Loading