Skip to content

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651

Open
r-devulap wants to merge 2 commits intomainfrom
use-native-calcpartialsum
Open

Wire calculatePartialSums to native SIMD via Panama FFI downcall#651
r-devulap wants to merge 2 commits intomainfrom
use-native-calcpartialsum

Conversation

@r-devulap
Copy link
Copy Markdown

@r-devulap r-devulap commented Apr 2, 2026

This change uses a native implementation of calculatePartialSums to accelerate PQ query scoring.
On ada002-100k with FUSED_PQ (numPQsubspaces/M =96, JDK build 23.0.1+11-39), it delivers 2–3× higher QPS and 40–65% lower mean latency across common overquery settings. Index build time, disk usage, and heap usage show no meaningful regression. The optimization is isolated to the PQ path; non‑PQ queries are unaffected.

Combined QPS and Latency Results (FUSED_PQ)

topK = 10

Overquery QPS (main) QPS (native) Speedup Latency ms (main) Latency ms (native) Latency ↓
8,987 26,101 2.9× 0.462 0.171 −63%
8,361 21,706 2.6× 0.505 0.202 −60%
7,199 14,364 2.0× 0.590 0.292 −51%
10× 5,796 9,640 1.7× 0.731 0.431 −41%

topK = 100

Overquery QPS (main) QPS (native) Speedup Latency ms (main) Latency ms (native) Latency ↓
5,743 9,568 1.7× 0.736 0.439 −40%
4,174 5,574 1.3× 1.022 0.733 −28%
``

Summary of changes in this PR:

  • Wire calculatePartialSums in NativeVectorUtilSupport to a new Panama FFI downcall for the native calculate_partial_sums_f32_512 SIMD implementation.
  • Replace icelake-server gcc target with skylake-avx512 in build script (icelake-server isnt required to build our native code)
  • Remove global mutable state: eliminate initialIndexRegister, indexIncrement, maskSeventhBit, maskEighthBit globals and their constructor initializer; move mask constants (maskSeventhBit, maskEighthBit) to local scope inside lookup_partial_sums
  • Add shared reduce_add_128_ps and reduce_add_256_ps helper functions using proper horizontal-add sequences instead of store-to-array loops
  • Remove redundant if (length >= N) guards in all SIMD kernels — the loop body already handles the zero-iteration case correctly
  • Replace store-to-aligned-array horizontal reduction pattern with the new helpers across all 128- and 256-bit dot product and euclidean distance functions
  • Remove preferred_size parameter from dot_product_f32 and euclidean_f32; always dispatch to AVX-512 when length >= 16
  • Standardize inline annotations: replace attribute((always_inline)) inline with JV_FINLINE / JV_INLINE macros throughout

Raghuveer Devulapalli added 2 commits April 2, 2026 03:25
* Replace icelake-server gcc target with skylake-avx512 in build script

* Remove global mutable state: eliminate initialIndexRegister,
  indexIncrement, maskSeventhBit, maskEighthBit globals and their
  constructor initializer; move mask constants (maskSeventhBit,
  maskEighthBit) to local scope inside lookup_partial_sums

* Add shared reduce_add_128_ps and reduce_add_256_ps helper functions
  using proper horizontal-add sequences instead of store-to-array loops

* Remove redundant if (length >= N) guards in all SIMD kernels — the
  loop body already handles the zero-iteration case correctly

* Replace store-to-aligned-array horizontal reduction pattern with the
  new helpers across all 128- and 256-bit dot product and euclidean
  distance functions

* Remove preferred_size parameter from dot_product_f32 and
  euclidean_f32; always dispatch to AVX-512 when length >= 16

* Standardize inline annotations: replace __attribute__((always_inline))
  inline with JV_FINLINE / JV_INLINE macros throughout
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant