Skip to content

Multi partition cagra search#2035

Draft
jamxia155 wants to merge 32 commits into
NVIDIA:mainfrom
jamxia155:multi-segment-cagra-search
Draft

Multi partition cagra search#2035
jamxia155 wants to merge 32 commits into
NVIDIA:mainfrom
jamxia155:multi-segment-cagra-search

Conversation

@jamxia155

@jamxia155 jamxia155 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

This PR adds a CAGRA search API and associated implementation for batching workloads involving a single query and multiple indexes (e.g. index segments built from one dataset by cuvs-lucene). This promotes GPU utilization for non-batched queries by parallelizing work over multiple CTAs, while minimizing the number of concurrent kernels for potential host-side parallelism.

Addresses cuvs-lucene issue 124, issue 158, and cuvs issue 2166.

Associated cuvs-lucene PR: rapidsai/cuvs-lucene#133.

jamxia155 and others added 8 commits April 22, 2026 12:58
  Introduces a globally-optimized path for searching across multiple CAGRA
  index segments without per-segment device-to-host copies.

  - c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp:
    new C API function cuvsSelectK that wraps cuvs::selection::select_k
    for selecting the k smallest float values from a flat device array

  - BufferedCagraSearch (internal interface): searchIntoBuffer() writes
    per-segment CAGRA results into a slice of a caller-owned device buffer
    without syncing the stream or copying to host

  - CagraIndexImpl: implements BufferedCagraSearch; searchIntoBuffer()
    computes byte offsets into the global buffer using segmentIdx * topK

  - SelectKHelper: Panama FFI binding for cuvsSelectK

  - MultiSegmentCagraSearch: orchestrates the full pipeline — queue all
    per-segment searches, sync once, run cuvsSelectK on GPU, sync again,
    single D2H copy, decode results

  - MultiSegmentSearchResults: simple result carrier with count,
    segmentIndices, ordinals, and distances arrays
  Implements concurrent per-segment GPU search that eliminates per-segment
  device-to-host copies and CPU blocking on workspace deallocation.
  Key changes:
  - CudaStreamPool: fixed-size pool of non-blocking CUDA streams (one
    cuvsResources_t per slot). Segments are assigned to slots via
    round-robin so searches on different slots run concurrently. Pool size
    defaults to 8 and is overridden via the system property
    com.nvidia.cuvs.streamPoolSize.
  - MultiSegmentCagraSearch: single-query search across N index segments
    using the stream pool. All per-segment CAGRA kernels write into a
    shared device buffer (no per-segment D2H copy or stream sync). A
    single cuvsSelectK call finds the global top-k entirely on GPU, then
    one D2H copy transfers the results.
  - BufferedCagraSearch / CagraIndexImpl: new searchIntoBuffer() method
    that queues a CAGRA search kernel on a caller-supplied stream and
    writes results at a given row offset into a pre-allocated device
    buffer.
  - cuvsRMMAsyncMemoryResourceEnable() (C API + Java bindings): switches
    the current device memory resource to cuda_async_memory_resource so
    that workspace deallocations issued by CAGRA's search plan destructor
    are stream-ordered and non-blocking. Without this, cudaFree serializes
    kernel launches across streams regardless of stream assignment,
    nullifying the stream pool benefit.
  CudaStreamPool was a static singleton shared across all threads. With
  multiple concurrent query threads, calls to MultiSegmentCagraSearch.search()
  would alias onto the same pool slots after Math.floorMod, causing concurrent
  cudaEventRecord and cudaStreamWaitEvent calls on the same event handle
  (undefined behavior) and cross-thread stream interference. Additionally,
  CudaStreamPool.closeInstance() was called from CuVSResourcesImpl.close(),
  so whichever thread closed its resources first would destroy the shared pool
  while other threads were still using it.

  Fix: make CudaStreamPool a per-CuVSResources instance rather than a static
  singleton. One pool is created and owned by each CuVSResourcesImpl; it is
  closed when that instance is closed. Since CuVSResources is thread-local in
  the Lucene integration, each query thread gets its own independent set of
  streams and events with no sharing or locking required.

  - CudaStreamPool: remove static singleton (getOrCreate, closeInstance, static
    volatile instance, static AtomicInteger slotCounter); add package-private
    constructor; replace static slotCounter with an instance int and a new
    nextSlot(int count) method.

  - CuVSResourcesImpl: add final CudaStreamPool streamPool field (sized from
    com.nvidia.cuvs.streamPoolSize system property); close it directly in
    close(); add static getStreamPool(CuVSResources) helper for
    MultiSegmentCagraSearch to retrieve the per-resources pool.

  - MultiSegmentCagraSearch: get pool via CuVSResourcesImpl.getStreamPool
    and advance via pool.nextSlot. Remove redundant cuvsStreamSync after
    cuvsSelectK — the D2H copies are enqueued on the same stream so CUDA
    ordering already serializes them. Replace three separate hostArena.allocate
    calls with one contiguous allocation (Long.BYTES-aligned) sliced into three
    typed views, reducing OS-level allocation overhead per query.
The persistent kernel runner was previously keyed on (dataset_desc,
graph, fixed search params), which forced a destroy/recreate cycle for
every segment when searching a multi-segment Lucene index: each segment
has a different graph pointer and potentially a different auto-computed
max_iterations, producing a different hash on every call.

C++ changes (search_single_cta_kernel-inl.cuh):
- Move dataset_desc_ptr, graph_ptr, and graph_degree from fixed runner
  state into per-job fields in job_desc_t. The persistent kernel reads
  them from the job descriptor, so one runner instance can serve any
  number of segments without being rebuilt.
- Remove dataset_desc and graph arguments from persistent_runner_t
  constructor and calculate_parameter_hash; the runner is now keyed
  only on fixed kernel parameters (block_size, smem_size, itopk, etc.).
- Update select_and_run to initialize the device descriptor on the
  caller's stream and synchronize before submission, then pass
  dd_dev_ptr, graph.data_handle(), and graph_degree to runner::launch.
- Remove dd_host from persistent_runner_t; dataset upload is now the
  caller's responsibility on each launch.

Java changes:
- Add persistent, persistentLifetime, and persistentDeviceUsage fields,
  getters, and Builder methods to CagraSearchParams.
- Wire the three persistent params through CuVSParamsHelper into the
  Panama-generated cuvsCagraSearchParams struct.
…raSearch

In persistent mode, `searchIntoBuffer` blocks on the CPU until the GPU
signals completion via a system-scope atomic. Previously, segments were
searched sequentially, so the GPU processed one segment at a time per
query, leaving its job queue mostly idle between segment dispatches.

Submit one async task per pool slot so all slots' segment searches are
in-flight simultaneously. The persistent runner's job queue can hold
all N segment jobs at once, allowing GPU workers to execute segments in
parallel (bounded by worker_queue_size).

Segments are grouped by pool slot rather than submitted one-per-segment
to prevent concurrent access to the same cuvsResources_t handle: the
descriptor_cache stored inside the RAFT resources object is not
thread-safe, and multiple threads calling cuvsCagraSearch with the same
handle causes a SIGSEGV. Grouping ensures each cuvsResources_t is
accessed by at most one thread at a time. Effective parallelism is
min(numSegments, pool.size()); increasing cuvsStreamPoolSize raises
the ceiling.

In non-persistent mode the existing sequential loop is unchanged:
kernel launches are asynchronous and return immediately, so Java-level
parallelism adds overhead without benefit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-segment search (C++/C/Java):
- Add search_multi_segment() C++ function: builds a single search plan
  sized for the largest segment, packs per-segment descriptors, and
  launches search_kernel_ms — a new SINGLE_CTA kernel with grid
  dimensions (1, num_queries, num_segments) so each CTA independently
  searches one (query, segment) pair in a single kernel call.
- Add cuvsCagraSearchMultiSegment() C API wrapping the above.
- Simplify MultiSegmentCagraSearch.java to call
  cuvsCagraSearchMultiSegment() unconditionally, removing the previous
  per-segment stream pool, thread pool, CUDA event synchronization, and
  persistent/non-persistent branching. The search now completes in four
  phases: multi-segment kernel, GPU-side select-k, single D2H copy,
  result decoding.
- Add BufferedCagraSearch.getIndexHandle() to expose the raw
  cuvsCagraIndex_t handle needed by the multi-segment kernel dispatch.

Workspace pool (C/Java):
- Add cuvsResourcesSetWorkspacePool(): configures the per-resources
  temporary workspace as an uncapped RMM pool that grows without
  shrinking. After warmup, cuvsRMMAlloc/cuvsRMMFree hit the pool cache
  instead of calling cudaMallocAsync/cudaFreeAsync, eliminating CUDA
  context lock contention under concurrent query threads.
- Route cuvsRMMAlloc/cuvsRMMFree through the workspace resource so Java-
  side output buffer allocations also benefit from the pool.
- Expose setWorkspacePool() in the CuVSResources Java interface with
  implementations in CuVSResourcesImpl and SynchronizedCuVSResources.

Refactoring (search_single_cta_kernel-inl.cuh):
- Extract TopkVariant enum and select_topk_variant() helper, shared by
  search_kernel_config and search_kernel_config_ms, replacing duplicated
  if/else trees in both choose_itopk_and_mx_candidates() bodies.
- Extract kernel_dispatch_params::compute() to centralize max_candidates
  and max_itopk computation shared by select_and_run and
  select_and_run_multi_segment.
- Extract hashmap_element_count() static helper on the search struct,
  used by both set_params (single-segment) and run_multi_segment.
Add HALF(2) to the DataType enum and wire it through the Java layer so
callers can build and exchange float16 matrices without any Java-side
type conversion:

- CuVSMatrix.Builder: new addVector(short[]) overload; each element is
  a raw IEEE 754 binary16 bit pattern held in a short
- LinkerHelper: add C_SHORT via canonicalLayouts("short")
- CuVSMatrixBaseImpl: map HALF → C_SHORT in valueLayoutFromType();
  decode DLPack (kDLFloat, bits=16) → DataType.HALF in
  dataTypeFromTensor()
- CuVSMatrixInternal: HALF maps to the same kDLFloat DLPack type code
  as FLOAT; the bits field (16 vs 32) distinguishes them on the C side
- JDKProvider.MatrixBuilder: implement addVector(short[]) following the
  same MemorySegment.ofArray pattern as the other overloads
@jamxia155 jamxia155 requested review from a team as code owners April 22, 2026 23:02
@jamxia155 jamxia155 marked this pull request as draft April 22, 2026 23:15
@copy-pr-bot

copy-pr-bot Bot commented Apr 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This pull request introduces multi-segment CAGRA search functionality, GPU-based selection, and RMM memory pool management. It adds C and C++ APIs for multi-segment approximate nearest neighbor search, async memory allocation, and select-K operations, alongside corresponding Java bindings and internal implementations supporting concurrent per-segment GPU processing and stream-based result aggregation.

Changes

Cohort / File(s) Summary
Build Configuration
c/CMakeLists.txt
Added src/selection/select_k.cpp to the C API library source list.
C API: RMM Memory Management
c/include/cuvs/core/c_api.h, c/src/core/c_api.cpp
Introduced cuvsResourcesSetWorkspacePool to configure per-resources uncapped device memory pools and cuvsRMMAsyncMemoryResourceEnable to switch RMM to stream-ordered async allocation. Extended cuvsRMMMemoryResourceReset to clear thread-local async resources.
C API: Multi-Segment CAGRA
c/include/cuvs/neighbors/cagra.h, c/src/neighbors/cagra.cpp
Added cuvsCagraSearchMultiSegment to perform concurrent ANN searches across multiple CAGRA index segments, accepting per-segment index pointers and query/result tensors.
C API: Select-K
c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp
Introduced cuvsSelectK to select k smallest values from GPU tensors, returning values and column indices.
C++ API: Multi-Segment CAGRA
cpp/include/cuvs/neighbors/cagra.hpp, cpp/src/neighbors/cagra.cuh, cpp/src/neighbors/cagra_search_inst.cu.in
Added public search_multi_segment function overloads for float, half, int8_t, uint8_t index types with configurable output neighbor types (uint32_t or int64_t). Included explicit template instantiations for multi-segment variants.
C++ Implementation: Multi-Segment Search Logic
cpp/src/neighbors/detail/cagra/cagra_search.cuh
Implemented multi-segment CAGRA search orchestration, handling dataset validation, kernel configuration, per-segment device descriptor management, and distance postprocessing (cosine-expanded normalization).
C++ Implementation: Multi-Segment Kernel Infrastructure
cpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuh, cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh, cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh
Added multi-segment kernel execution path with per-segment descriptor structure, concurrent CTA scheduling via blockIdx.z, hashmap sizing across segments, persistent runner refactoring to use job-specific descriptors, and select_and_run_multi_segment launcher.
C++ Implementation: Multi-Segment Search Helpers
cpp/src/neighbors/detail/cagra/search_single_cta.cuh
Introduced run_multi_segment and hashmap_element_count helpers to orchestrate concurrent segment processing and allocate global hashmap buffers.
Java: Minor Updates
java/cuvs-java/src/main/java/com/nvidia/cuvs/{CagraIndex,HnswIndex}.java, java/cuvs-java/src/main/java/com/nvidia/cuvs/{CuVSAceParams,HnswAceParams,HnswIndexParams}.java
Updated SPDX copyright headers and reformatted method signatures for consistency.
Java: Core API Extensions
java/cuvs-java/src/main/java/com/nvidia/cuvs/{CagraSearchParams,CuVSMatrix,CuVSResources,SynchronizedCuVSResources}.java
Added persistent kernel configuration options to CagraSearchParams, HALF data type support to CuVSMatrix, and setWorkspacePool method to CuVSResources interface and implementations.
Java: SPI
java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/{CuVSProvider,UnsupportedProvider}.java
Added enableRMMAsyncMemory() method to provider interface and stub implementation.
Java: Multi-Segment Search API
java/cuvs-java/src/main/java22/com/nvidia/cuvs/{MultiSegmentSearchResults,MultiSegmentCagraSearch}.java
Introduced result class to represent decoded multi-segment search outputs and public API for multi-segment search with global top-k selection via GPU-resident select-K.
Java: Buffered Search Interface
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{BufferedCagraSearch,CagraIndexImpl}.java
Defined interface for GPU-side buffered search into caller-provided device buffers and implemented support in CagraIndexImpl for per-segment search without host synchronization.
Java: Internal Helpers & Bindings
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{SelectKHelper,CuVSParamsHelper,CudaStreamPool,LinkerHelper}.java
Added native FFM bindings for select-K, CAGRA search parameters builder, fixed-size CUDA stream/resource pool with round-robin slot allocation, and C_SHORT layout constant.
Java: Internal Matrix & Resources
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/{CuVSMatrixBaseImpl,CuVSMatrixInternal,CuVSResourcesImpl,HnswIndexImpl}.java
Added half-precision float tensor handling, integrated stream pool ownership in resources, workspace pool implementation via native API, and formatting adjustments.
Java: Provider Implementation
java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
Implemented enableRMMAsyncMemory() and added short[] vector support for matrix building.
Java: Test Utilities
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java
Implemented setWorkspacePool delegation in test wrapper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title 'Multi partition cagra search' does not accurately represent the main change. The changeset implements multi-segment (not multi-partition) CAGRA search across the full stack (C, C++, Java), with new APIs (cuvsSelectK, BufferedCagraSearch), memory management features, and kernel optimizations. Update the title to 'Add multi-segment CAGRA search with select-k and async memory support' or similar to accurately reflect the scope and main components of the changeset.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is directly related to the changeset, explaining the core feature (CAGRA multi-segment search API) and its purpose (batching workloads for single query, multiple indexes).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)

293-322: ⚠️ Potential issue | 🟡 Minor

The toString() method does not include the new persistent fields.

The three new fields (persistent, persistentLifetime, persistentDeviceUsage) are not included in the toString() output. While not critical, this creates an inconsistency where debugging output won't show the full configuration.

🔧 Proposed fix to include persistent fields in toString()
         + ", randXORMask="
         + randXORMask
+        + ", persistent="
+        + persistent
+        + ", persistentLifetime="
+        + persistentLifetime
+        + ", persistentDeviceUsage="
+        + persistentDeviceUsage
         + "]";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around
lines 293 - 322, The toString() in class CagraSearchParams currently omits the
new fields persistent, persistentLifetime, and persistentDeviceUsage; update the
CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.
🧹 Nitpick comments (3)
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java (1)

59-62: Add destroyed-state guard in the new delegate method.

setWorkspacePool(...) should call checkNotDestroyed() before delegating, consistent with this wrapper’s defensive contract.

Proposed change
   `@Override`
   public void setWorkspacePool(long sizeBytes) {
+    checkNotDestroyed();
     inner.setWorkspacePool(sizeBytes);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`
around lines 59 - 62, The new delegate method setWorkspacePool(long sizeBytes)
is missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)

527-537: Consider adding validation for persistentDeviceUsage bounds.

The Javadoc states the value "must be greater than 0.0 and not greater than 1.0", but the setter doesn't validate this constraint. Invalid values would only fail at the native layer.

🛡️ Optional: Add validation in the builder
     public Builder withPersistentDeviceUsage(float persistentDeviceUsage) {
+      if (persistentDeviceUsage <= 0.0f || persistentDeviceUsage > 1.0f) {
+        throw new IllegalArgumentException(
+            "persistentDeviceUsage must be > 0.0 and <= 1.0, got: " + persistentDeviceUsage);
+      }
       this.persistentDeviceUsage = persistentDeviceUsage;
       return this;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around
lines 527 - 537, The builder setter withPersistentDeviceUsage currently assigns
persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.
cpp/src/neighbors/detail/cagra/cagra_search.cuh (1)

388-417: Consider hoisting query_norms allocation outside the loop for CosineExpanded metric.

When multiple segments use CosineExpanded metric, query_norms is allocated and computed redundantly for each segment. Since queries are the same across segments (repeated query vector), the norms could be computed once.

This is a minor optimization opportunity. The current implementation is correct; the redundant computation is bounded by the number of segments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh` around lines 388 - 417,
Hoist the allocation and computation of query_norms out of the segment loop and
reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@c/include/cuvs/core/c_api.h`:
- Around line 229-240: The current implementation of
cuvsRMMAsyncMemoryResourceEnable stores cuda_async_memory_resource in a
thread_local async_mr and passes it to rmm::mr::set_current_device_resource(),
creating a lifetime mismatch vs the API doc that says the change is global;
change the implementation so the async memory resource is process-scoped (not
thread_local) by replacing thread_local async_mr with a static/process-global
instance (e.g., a static unique_ptr or static object) so it outlives the thread
and remains valid for rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.

In `@c/src/neighbors/cagra.cpp`:
- Around line 710-733: The loop currently casts every indices[i]->addr and
builds device views without validating each segment's types/devices; update the
loop that fills idx_vec, q_vec, n_vec, d_vec to perform the same per-segment
checks used in cuvsCagraSearch: assert indices[i] != nullptr and addr != 0
(already present), then verify indices[i]->dtype.code == kDLFloat &&
indices[i]->dtype.bits == 32 for every i, and validate that queries[i],
neighbors[i], and distances[i] are device-backed DLPack tensors with the
expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
- Around line 726-736: The loop that builds idx_vec/q_vec/n_vec/d_vec must also
validate that all segment indices use the same distance metric as the first
segment to prevent mixing incompatible metrics; inside the for-loop in cagra.cpp
(after the existing RAFT_EXPECTS that checks indices[i] non-null) compare
indices[i]->metric (or the actual metric field name on your IndexT struct/class)
to indices[0]->metric and fail fast with RAFT_EXPECTS (or equivalent) and a
clear message like "Mixed distance metrics across segments: expected %s but got
%s at segment %u"; keep this check before pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.

In `@c/src/selection/select_k.cpp`:
- Around line 14-40: cuvsSelectK dereferences shape[1] and casts buffers without
validating the DLPack tensors; add explicit validation at the top of cuvsSelectK
for in_val, out_val, out_idx (non-null), then check each
DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows match expected
1 or compatible), dtype (in_val/out_val float32, out_idx int64), device type
(CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.

In `@cpp/include/cuvs/neighbors/cagra.hpp`:
- Around line 1752-1806: Add Doxygen documentation for each overloaded
search_multi_segment declaration so they appear in generated API docs; either
add brief doxygen blocks above each overload or use `@copydoc` to reference the
primary search_multi_segment doc block (e.g., use `@copydoc`
search_multi_segment(raft::resources const&,
cuvs::neighbors::cagra::search_params const&, const std::vector<const
cuvs::neighbors::cagra::index<float, uint32_t>*>&, const
std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>&))
for the overloads with half/int8_t/uint8_t and uint32_t/int64_t neighbor types
so each signature (the overloads of search_multi_segment) is documented.
- Around line 1744-1806: The header declares the template parameter order as <T,
OutputIdxT, IdxT> but the instantiation macro (used with <T, IdxT, OutputIdxT>
e.g. (data_t, uint32_t, int64_t)) expects <T, IdxT, OutputIdxT>, causing the
IdxT/OutputIdxT swap that trips the static_assert in cagra_search.cuh:276 and
breaks link-time overloads for int64_t; fix by changing the template parameter
order in the search_multi_segment declarations to <T, IdxT, OutputIdxT> (or
alternatively update the instantiation macro to match the declared order) so
types map correctly, and add missing Doxygen for overloads 2–8 by inserting a
`@copydoc` search_multi_segment (or equivalent documentation block) above each of
those overloaded search_multi_segment declarations to satisfy the public API
docs requirement.

In `@cpp/src/neighbors/cagra.cuh`:
- Around line 409-420: The wrapper search_multi_segment currently forwards
indices, queries, neighbors, and distances without validation; add the same
upfront shape checks used by search() before calling
cagra::detail::search_multi_segment: verify indices.size() == queries.size() ==
neighbors.size() == distances.size(), then for each segment i ensure
queries[i].n_rows == neighbors[i].n_rows, neighbors[i].n_cols ==
distances[i].n_cols (k matches), queries[i].n_cols equals the index dimension
for indices[i] (or indices[i]->dim() / appropriate accessor), and that k > 0 and
dims are consistent; if any check fails, return/throw a clear error (or use
RAFT/CUASSERT used elsewhere) rather than forwarding to the detail
implementation.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java`:
- Around line 60-76: The Java API should validate workspace pool sizes before
calling native code: update the Javadoc for setWorkspacePool to state the valid
range is > 0, and in the implementation of setWorkspacePool (the method that
currently invokes the native cuvsResourcesSetWorkspacePool) add a check that
rejects non-positive values (<= 0) by throwing an appropriate Java exception
(e.g., IllegalArgumentException) with a clear message; only call
cuvsResourcesSetWorkspacePool when the value is positive to avoid
signed->unsigned wraparound in native size_t.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java`:
- Around line 24-27: MultiSegmentSearchResults currently stores native uint32
ordinals in an int[] (field ordinals) which corrupts values > Integer.MAX_VALUE;
change ordinals from int[] to long[] (and any constructor/getter signatures) and
ensure code that decodes native uint32_t values writes unsigned values into the
long (e.g., value & 0xFFFFFFFFL) so ordinals remain non-negative; update any
consumers (notably MultiSegmentCagraSearch) to compare against a long sentinel
(e.g., -1L) or otherwise handle long ordinals instead of treating negative ints
as sentinels.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java`:
- Around line 43-46: The setWorkspacePool method in SynchronizedCuVSResources is
not using the shared lock and must be serialized like access(); modify
setWorkspacePool to acquire the same lock used by access() (e.g., wrap the call
to inner.setWorkspacePool(sizeBytes) in the synchronized block or lock guard
used by access()) so that mutations to workspace pool are protected by the same
synchronization as access().

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java`:
- Around line 391-396: The offset calculation uses segmentIdx * topK
(neighborByteOffset/distanceByteOffset) but the code builds tensors with shape
{numQueries, topK} (numQueries from queryVectors.size()), so when numQueries > 1
the buffer offsets are wrong; either enforce single-query by adding a guard (if
(numQueries != 1) throw new IllegalArgumentException(...)) near where numQueries
is computed (queryVectors) or update the offsets to multiply by numQueries
(neighborByteOffset = segmentIdx * numQueries * topK * C_INT_BYTE_SIZE and
distanceByteOffset = segmentIdx * numQueries * topK * Float.BYTES) before
creating neighborSlice/distanceSlice (globalNeighborsDP/globalDistancesDP).

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`:
- Around line 115-119: Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
- Around line 125-131: In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
- Around line 59-85: The constructor CudaStreamPool currently leaks native
handles if a create call fails mid-loop; modify the CudaStreamPool(int size)
constructor to perform rollback cleanup on failure by tracking the current index
and, if any checkCudaError/checkCuVSError throws, iterating over
already-initialized entries in resources[], streams[], and events[] to call the
corresponding destroy functions (cudaStreamDestroy for streams[],
cudaEventDestroy for events[], and the cuvs resources destroy routine for
resources[]) before rethrowing the exception; implement this by wrapping the
allocation loop in try/catch (or try/finally with a success flag) and invoking
the same cleanup logic as close() for indices < currentIndex so no native
handles are leaked if construction aborts.

In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`:
- Around line 31-33: The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`:
- Around line 119-145: The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
- Around line 96-111: The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.

---

Outside diff comments:
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 293-322: The toString() in class CagraSearchParams currently omits
the new fields persistent, persistentLifetime, and persistentDeviceUsage; update
the CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.

---

Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh`:
- Around line 388-417: Hoist the allocation and computation of query_norms out
of the segment loop and reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.

In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 527-537: The builder setter withPersistentDeviceUsage currently
assigns persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.

In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`:
- Around line 59-62: The new delegate method setWorkspacePool(long sizeBytes) is
missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 626e7fa1-95fc-4f40-b30d-d7fbcb6521a6

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 49e5a14.

📒 Files selected for processing (40)
  • c/CMakeLists.txt
  • c/include/cuvs/core/c_api.h
  • c/include/cuvs/neighbors/cagra.h
  • c/include/cuvs/selection/select_k.h
  • c/src/core/c_api.cpp
  • c/src/neighbors/cagra.cpp
  • c/src/selection/select_k.cpp
  • cpp/include/cuvs/neighbors/cagra.hpp
  • cpp/src/neighbors/cagra.cuh
  • cpp/src/neighbors/cagra_search_inst.cu.in
  • cpp/src/neighbors/detail/cagra/cagra_search.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
  • cpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuh
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSAceParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSMatrix.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswAceParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndex.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndexParams.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.java
  • java/cuvs-java/src/main/java/com/nvidia/cuvs/spi/UnsupportedProvider.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BufferedCagraSearch.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixBaseImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixInternal.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSParamsHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/SelectKHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/LinkerHelper.java
  • java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
  • java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java

Comment thread c/include/cuvs/core/c_api.h Outdated
Comment on lines +229 to +240
/**
* @brief Switches the working memory resource to use stream-ordered asynchronous allocation
* (cudaMallocAsync / cudaFreeAsync). Unlike the pool resource, this resource returns memory to
* the stream immediately without blocking the CPU, eliminating device-wide synchronization on
* deallocation. This is especially beneficial when multiple CAGRA searches run concurrently on
* separate CUDA streams, because the internal workspace allocations no longer serialize kernel
* launches. Be aware that this function will change the memory resource for the whole process
* and the new memory resource will be used until explicitly changed.
*
* @return cuvsError_t
*/
cuvsError_t cuvsRMMAsyncMemoryResourceEnable();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "c_api.cpp" -path "*/c/src/core/*" | head -5

Repository: rapidsai/cuvs

Length of output: 79


🏁 Script executed:

cat -n c/src/core/c_api.cpp | sed -n '160,195p'

Repository: rapidsai/cuvs

Length of output: 1665


🏁 Script executed:

cd c/src/core && grep -n "thread_local" c_api.cpp

Repository: rapidsai/cuvs

Length of output: 269


🏁 Script executed:

cd c/include/cuvs/core && grep -n "cuvsRMMAsyncMemoryResourceEnable" c_api.h -A 15

Repository: rapidsai/cuvs

Length of output: 580


🏁 Script executed:

# Search for RMM set_current_device_resource usage patterns in the codebase
rg "set_current_device_resource" --type cpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 2101


🏁 Script executed:

# Check if there are any docs changes or migration guides in the PR
find . -name "*.md" -o -name "*.rst" | xargs grep -l "RMM\|AsyncMemory\|cuvsRMM" 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 200


🏁 Script executed:

# Check for any deprecation warnings or migration documentation
find . -type f \( -name "*.h" -o -name "*.md" \) -path "*/docs/*" | xargs grep -l "cuvsRMM\|memory.*resource" 2>/dev/null | head -5

Repository: rapidsai/cuvs

Length of output: 89


🏁 Script executed:

# Check if there are any API change notes or migration guides in the docs
cat docs/source/api_basics.rst | head -100

Repository: rapidsai/cuvs

Length of output: 2518


🏁 Script executed:

# Check CHANGELOG for any RMM API changes mentioned
rg "Async|AsyncMemory|cuvsRMMAsync" CHANGELOG.md -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if reset_current_device_resource clears the async_mr
cat -n c/src/core/c_api.cpp | sed -n '192,210p'

Repository: rapidsai/cuvs

Length of output: 784


🏁 Script executed:

# Check how pool memory resource is handled - it passes rvalues
# Let's see if there's any documentation on RMM's set_current_device_resource behavior
rg "set_current_device_resource" --type h | head -20

Repository: rapidsai/cuvs

Length of output: 331


🏁 Script executed:

# Look for RMM in vendored code or requirements to understand its semantics
find . -name "*.txt" -o -name "*.cmake" | xargs grep -i "rmm" | head -10

Repository: rapidsai/cuvs

Length of output: 779


🏁 Script executed:

# Check if there's any documentation update about async memory resource being process-wide
rg "cuvsRMMAsyncMemoryResourceEnable" docs/ 2>/dev/null || echo "No docs match found"

Repository: rapidsai/cuvs

Length of output: 76


🏁 Script executed:

# Let's check how the pool resource implementation differs from async
# The pool passes a rvalue (temporary), while async passes a pointer to thread_local
# Search for any notes on RMM memory resource ownership in comments
rg "ownership|lifetime|thread" c/src/core/c_api.cpp | head -20

Repository: rapidsai/cuvs

Length of output: 275


🏁 Script executed:

# Check if cuvsRMMPoolMemoryResourceEnable has similar issues by checking if it's thread_local or static
rg "pool_mr" c/src/core/c_api.cpp -B 5 -A 5

Repository: rapidsai/cuvs

Length of output: 665


🏁 Script executed:

# Let's check the RMM version used and look for any behavior documentation
rg "rmm" vcpkg.json conanfile.txt requirements.txt 2>/dev/null | head -10

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check if there are any thread safety comments or warnings in the C API header
cat -n c/include/cuvs/core/c_api.h | sed -n '220,250p'

Repository: rapidsai/cuvs

Length of output: 1728


The async-memory resource owner cannot be thread_local when this API changes the current resource globally.

The implementation at c/src/core/c_api.cpp:188 stores cuda_async_memory_resource in thread_local async_mr and passes it to rmm::mr::set_current_device_resource(), but the documentation explicitly states this function "will change the memory resource for the whole process" (line 235). This creates a critical lifetime mismatch:

  • If set_current_device_resource() is device-scoped (affecting all threads), then when the enabling thread exits, its thread_local async_mr is destroyed while still registered as the current resource, leaving RMM with a dangling pointer.
  • If thread-local semantics were intended, the documentation must be updated to reflect that this only affects the calling thread.

The pool resource avoids this issue by passing temporary rvalues to set_current_device_resource(), allowing RMM to manage the lifetime. Either make async_mr process/device-scoped (not thread_local), or narrow the documentation and implementation to clarify thread-local semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/include/cuvs/core/c_api.h` around lines 229 - 240, The current
implementation of cuvsRMMAsyncMemoryResourceEnable stores
cuda_async_memory_resource in a thread_local async_mr and passes it to
rmm::mr::set_current_device_resource(), creating a lifetime mismatch vs the API
doc that says the change is global; change the implementation so the async
memory resource is process-scoped (not thread_local) by replacing thread_local
async_mr with a static/process-global instance (e.g., a static unique_ptr or
static object) so it outlives the thread and remains valid for
rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.

Comment thread c/src/neighbors/cagra.cpp Outdated
Comment on lines +710 to +733
// Only float32 is supported for multi-segment search.
RAFT_EXPECTS(
indices[0]->dtype.code == kDLFloat && indices[0]->dtype.bits == 32,
"Multi-segment search only supports float32 indices");

using T = float;
using IdxT = uint32_t;
using OutIdxT = uint32_t;
using DistanceT = float;
using IndexT = cuvs::neighbors::cagra::index<T, IdxT>;

std::vector<const IndexT*> idx_vec(num_segments);
std::vector<raft::device_matrix_view<const T, int64_t, raft::row_major>> q_vec(num_segments);
std::vector<raft::device_matrix_view<OutIdxT, int64_t, raft::row_major>> n_vec(num_segments);
std::vector<raft::device_matrix_view<DistanceT, int64_t, raft::row_major>> d_vec(num_segments);

for (uint32_t i = 0; i < num_segments; i++) {
RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0,
"Index at position %u is null or not built", i);
idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr);
q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]);
n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]);
d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate every segment before the reinterpret_cast.

Only indices[0] is checked for float32. The loop then blindly casts every indices[i]->addr to index<float, uint32_t>* and builds float32 device views from queries[i], neighbors[i], and distances[i]. If any later segment is half/int8/uint8 or any tensor is not device-backed, this path will feed the kernel mismatched types and can produce garbage results or GPU faults. Please mirror the per-input dtype/device/null checks from cuvsCagraSearch for each segment before from_dlpack.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 710 - 733, The loop currently casts
every indices[i]->addr and builds device views without validating each segment's
types/devices; update the loop that fills idx_vec, q_vec, n_vec, d_vec to
perform the same per-segment checks used in cuvsCagraSearch: assert indices[i]
!= nullptr and addr != 0 (already present), then verify indices[i]->dtype.code
== kDLFloat && indices[i]->dtype.bits == 32 for every i, and validate that
queries[i], neighbors[i], and distances[i] are device-backed DLPack tensors with
the expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.

Comment thread c/src/neighbors/cagra.cpp Outdated
Comment on lines +726 to +736
for (uint32_t i = 0; i < num_segments; i++) {
RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0,
"Index at position %u is null or not built", i);
idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr);
q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]);
n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]);
d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]);
}

cuvs::neighbors::cagra::search_multi_segment(
*res_ptr, search_params, idx_vec, q_vec, n_vec, d_vec);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject mixed-distance metrics across segments.

This API combines raw distances from all segments into one global ranking, so all indices must use the same metric. Right now nothing checks that indices[i] matches indices[0] on metric, which means mixing L2/IP/Cosine segments will return nonsensical top-k ordering even though the header says the distances are comparable across segments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/neighbors/cagra.cpp` around lines 726 - 736, The loop that builds
idx_vec/q_vec/n_vec/d_vec must also validate that all segment indices use the
same distance metric as the first segment to prevent mixing incompatible
metrics; inside the for-loop in cagra.cpp (after the existing RAFT_EXPECTS that
checks indices[i] non-null) compare indices[i]->metric (or the actual metric
field name on your IndexT struct/class) to indices[0]->metric and fail fast with
RAFT_EXPECTS (or equivalent) and a clear message like "Mixed distance metrics
across segments: expected %s but got %s at segment %u"; keep this check before
pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.

Comment on lines +14 to +40
extern "C" cuvsError_t cuvsSelectK(cuvsResources_t res,
DLManagedTensor* in_val,
DLManagedTensor* out_val,
DLManagedTensor* out_idx)
{
return cuvs::core::translate_exceptions([=] {
auto* res_ptr = reinterpret_cast<raft::resources*>(res);

int64_t n = in_val->dl_tensor.shape[1];
int64_t k = out_val->dl_tensor.shape[1];

auto in_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
static_cast<const float*>(in_val->dl_tensor.data), 1, n);

auto out_val_view = raft::make_device_matrix_view<float, int64_t, raft::row_major>(
static_cast<float*>(out_val->dl_tensor.data), 1, k);

auto out_idx_view = raft::make_device_matrix_view<int64_t, int64_t, raft::row_major>(
static_cast<int64_t*>(out_idx->dl_tensor.data), 1, k);

cuvs::selection::select_k(
*res_ptr,
in_view,
std::nullopt, // implicit positions [0, n) as in_idx
out_val_view,
out_idx_view,
true); // select_min = true (smallest distance = nearest neighbor)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate the DLPack contract before dereferencing shape[1] and casting buffers.

cuvsSelectK currently assumes non-null 2D CUDA tensors with float32 / float32 / int64 dtypes, zero offset, and contiguous row-major layout. Without checking ndim, shape, device, dtype, byte_offset, and stride compatibility, malformed callers can crash here or feed corrupted views into select_k.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/src/selection/select_k.cpp` around lines 14 - 40, cuvsSelectK dereferences
shape[1] and casts buffers without validating the DLPack tensors; add explicit
validation at the top of cuvsSelectK for in_val, out_val, out_idx (non-null),
then check each DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows
match expected 1 or compatible), dtype (in_val/out_val float32, out_idx int64),
device type (CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.

Comment thread cpp/include/cuvs/neighbors/cagra.hpp Outdated
Comment on lines +1744 to +1806
void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

void search_multi_segment(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Declaration in cpp/src/neighbors/detail/cagra/cagra_search.cuh"
rg -n -A8 -B2 'template <typename T, typename OutputIdxT = uint32_t, typename IdxT = uint32_t' \
  cpp/src/neighbors/detail/cagra/cagra_search.cuh

echo
echo "Explicit instantiation site in cpp/src/neighbors/cagra_search_inst.cu.in"
rg -n -A8 -B2 'CUVS_INST_CAGRA_SEARCH_MULTI_SEGMENT|search_multi_segment<' \
  cpp/src/neighbors/cagra_search_inst.cu.in

Repository: rapidsai/cuvs

Length of output: 2317


🏁 Script executed:

# Check for static_assert on IdxT in the implementation
rg -n "static_assert.*IdxT" cpp/src/neighbors/detail/cagra/cagra_search.cuh

echo "---"

# Read the header to see all overloads and their documentation
cat -n cpp/include/cuvs/neighbors/cagra.hpp | sed -n '1710,1810p'

Repository: rapidsai/cuvs

Length of output: 6679


Fix template parameter order mismatch and add missing documentation.

The instantiation macro calls the template with parameter order <T, IdxT, OutputIdxT>, but the template declares them as <T, OutputIdxT, IdxT>. When the macro instantiates with (data_t, uint32_t, int64_t), it produces <data_t, uint32_t, int64_t>, incorrectly mapping IdxT to int64_t. This violates the static_assert at cagra_search.cuh:276 (IdxT == uint32_t) and will fail at link time for the int64_t output overloads.

Additionally, overloads 2–8 (lines 1752–1806) lack Doxygen documentation. Per the coding guidelines, all public functions must be documented. Add a @copydoc search_multi_segment directive before each additional overload, or provide explicit documentation blocks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/cagra.hpp` around lines 1744 - 1806, The header
declares the template parameter order as <T, OutputIdxT, IdxT> but the
instantiation macro (used with <T, IdxT, OutputIdxT> e.g. (data_t, uint32_t,
int64_t)) expects <T, IdxT, OutputIdxT>, causing the IdxT/OutputIdxT swap that
trips the static_assert in cagra_search.cuh:276 and breaks link-time overloads
for int64_t; fix by changing the template parameter order in the
search_multi_segment declarations to <T, IdxT, OutputIdxT> (or alternatively
update the instantiation macro to match the declared order) so types map
correctly, and add missing Doxygen for overloads 2–8 by inserting a `@copydoc`
search_multi_segment (or equivalent documentation block) above each of those
overloaded search_multi_segment declarations to satisfy the public API docs
requirement.

Comment on lines +115 to +119
public int nextSlot(int count) {
int start = slotCounter;
slotCounter += count;
return start;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make nextSlot atomic and modulo-bounded.

At Line 116–118, slotCounter updates are racy and the returned value is not wrapped despite the method contract; this can cause slot collisions and eventual overflow issues.

🔧 Proposed fix
 import java.lang.foreign.Arena;
 import java.lang.foreign.MemorySegment;
+import java.util.concurrent.atomic.AtomicInteger;
@@
-  private int slotCounter;
+  private final AtomicInteger slotCounter = new AtomicInteger(0);
@@
   public int nextSlot(int count) {
-    int start = slotCounter;
-    slotCounter += count;
-    return start;
+    if (count <= 0 || count > size) {
+      throw new IllegalArgumentException("count must be in [1, size]");
+    }
+    return Math.floorMod(slotCounter.getAndAdd(count), size);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public int nextSlot(int count) {
int start = slotCounter;
slotCounter += count;
return start;
}
public int nextSlot(int count) {
if (count <= 0 || count > size) {
throw new IllegalArgumentException("count must be in [1, size]");
}
return Math.floorMod(slotCounter.getAndAdd(count), size);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 115 - 119, Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.

Comment on lines +125 to +131
@Override
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
cuvsResourcesDestroy(resources[i]);
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify all cuvsResourcesDestroy call sites are validated.
rg -n -C2 '\bcuvsResourcesDestroy\s*\(' --type=java

Repository: rapidsai/cuvs

Length of output: 1298


Wrap cuvsResourcesDestroy with error validation during cleanup.

Line 129 calls cuvsResourcesDestroy without checkCuVSError, allowing cleanup failures to be silently ignored. Other call sites in the codebase (e.g., CuVSResourcesImpl.java) properly validate the return code. Apply the same pattern here for consistency:

Proposed fix
-      cuvsResourcesDestroy(resources[i]);
+      checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Override
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
cuvsResourcesDestroy(resources[i]);
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}
`@Override`
public void close() {
for (int i = 0; i < size; i++) {
checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy");
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 125 - 131, In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.

Comment on lines +31 to +33
private final CudaStreamPool streamPool =
new CudaStreamPool(
Integer.getInteger(CudaStreamPool.SIZE_PROPERTY, CudaStreamPool.DEFAULT_SIZE));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Construct the stream pool inside the constructor.

Allocating CudaStreamPool in a field initializer leaks native resources if CuVSResourcesImpl(Path) throws later during cuvsResourcesCreate/cuvsDeviceIdGet, because the object never finishes construction and close() is never reachable. Build the pool inside the constructor and tear it down in the constructor's failure path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`
around lines 31 - 33, The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.

Comment on lines +96 to +111
CagraSearchParams searchParameters = queries.get(0).getCagraSearchParameters();

try (var resourcesAccessor = resources.access()) {
long cuvsRes = resourcesAccessor.handle();
var cuvsStream = getStream(cuvsRes);

try (var globalNeighborsDP = allocateRMMSegment(cuvsRes, neighborsBytes);
var globalDistancesDP = allocateRMMSegment(cuvsRes, distancesBytes);
var outIdxDP = allocateRMMSegment(cuvsRes, outIdxBytes);
var outValDP = allocateRMMSegment(cuvsRes, outValBytes)) {

// --- Phase 1: call cuvsCagraSearchMultiSegment ---
// Single kernel launch covers all segments; results land in globalNeighborsDP /
// globalDistancesDP on the same CUDA stream, so SelectK below sees them via ordering.
try (var arena = Arena.ofConfined()) {
MemorySegment sp = CuVSParamsHelper.buildCagraSearchParams(arena, searchParameters);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Per-segment query settings are being dropped.

This path always builds search params from queries.get(0), and the native multi-segment API you're calling has no filter argument, so any search params or prefilter carried by queries[1..] are silently ignored. Either validate that every CagraQuery is equivalent and unfiltered, or plumb per-segment filtering/params through the native API before exposing List<CagraQuery> here.

Also applies to: 147-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 96 - 111, The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.

Comment on lines +119 to +145
long[] segShape = {1, k};
for (int i = 0; i < numSegments; i++) {
// Index handle
indexArray.setAtIndex(ValueLayout.ADDRESS, i, buffered[i].getIndexHandle());

// Query DLTensor
var queryVectors = (CuVSMatrixInternal) queries.get(i).getQueryVectors();
queriesArray.setAtIndex(ValueLayout.ADDRESS, i, queryVectors.toTensor(arena));

// Neighbors DLTensor — slice of global buffer
long nByteOffset = (long) i * k * Integer.BYTES;
MemorySegment nSlice =
MemorySegment.ofAddress(globalNeighborsDP.handle().address() + nByteOffset);
neighborsArray.setAtIndex(
ValueLayout.ADDRESS,
i,
prepareTensor(arena, nSlice, segShape, kDLUInt(), 32, kDLCUDA()));

// Distances DLTensor — slice of global buffer
long dByteOffset = (long) i * k * Float.BYTES;
MemorySegment dSlice =
MemorySegment.ofAddress(globalDistancesDP.handle().address() + dByteOffset);
distancesArray.setAtIndex(
ValueLayout.ADDRESS,
i,
prepareTensor(arena, dSlice, segShape, kDLFloat(), 32, kDLCUDA()));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Enforce the single-query contract before sizing output buffers.

Each segment gets a [1, k] neighbors/distances tensor, but the input query tensor comes straight from queries.get(i).getQueryVectors(). If any caller passes nq > 1, the native search will try to write nq * k results into storage sized for only k, which is an out-of-bounds device write. Please reject multi-row query tensors here or size the output slices from the actual query row count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 119 - 145, The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.

@jamxia155

jamxia155 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

Pareto frontiers from initial benchmarking:
image
image

Hardware: 32/64-core/thread CPU, L40S 48GB
Dataset: 50M x 64 dim x 16-bit
Top-K: 100
CAGRA_HNSW (CAGRA index converted to HNSW, search on CPU) parameter sweep ranges:

  • "efSearch": [100, 150, 200]
  • "queryThreads": [1, 8, 64]
  • "cagraGraphDegree": [16, 32, 64, 128]
  • "cagraIntermediateGraphDegree": [32, 64, 128, 256, 512]
  • "cagraHnswLayers": [1, 2, 4, 6]

CAGRA_SEARCH (CAGRA index used directly for search on GPU) parameter sweep ranges:

  • "efSearch": [100, 150]
  • "queryThreads": [1, 8, 64]
  • "cagraGraphDegree": [16, 32, 64, 128]
  • "cagraIntermediateGraphDegree": [32, 64, 128, 256, 512]
  • "cagraSearchWidth": [4, 16, 128, 256, 1024]

@aamijar aamijar added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 1, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing May 1, 2026
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing May 10, 2026
Comment thread c/include/cuvs/neighbors/cagra.h Outdated
* @param[out] neighbors array of num_segments DLManagedTensor* (device, uint32, [nq, topk])
* @param[out] distances array of num_segments DLManagedTensor* (device, float32, [nq, topk])
*/
cuvsError_t cuvsCagraSearchMultiSegment(cuvsResources_t res,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to align on nomenclature a bit, I wonder if we can think of a more general name. Maybe "Partition"? Segment is pretty closely coupled to databases, and more specifically to LSM-based databases, but cuVS the library is more general that that. cuVS is at the level of "hash partitioning" or "blind sharding" (those are the terms we tend to use in this context). I think "MultiPartition" would be a more fitting name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning to "partition" for now. FYI, also considered: "MultiShard", "MultiIndex", "Federated", but these might come with unintended connotations.

Comment thread c/include/cuvs/selection/select_k.h Outdated
* @param[out] out_idx DLManagedTensor* shape [1, k], int64, device memory
* @return cuvsError_t
*/
cuvsError_t cuvsSelectK(cuvsResources_t res,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is great. I was just working on code examples for the new docs and realized we only have a C++ API for select_k. It'll be great to get the C APis, and later on the Python and other language wrappers for select-k.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the refactoring prompted by your other comment, select-k is no longer needed for this work. Leaving the C API intact in case it might be useful to others.

Comment thread cpp/include/cuvs/neighbors/cagra.hpp Outdated
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to return a vector of outputs instead of a single output. Can't we perform the reduction for the final single neighborhood outputs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this API to return the global top-k results (i.e. after select_k) instead of returning per-partition top-k results which then have to be reduced via select_k.

jamxia155 added 6 commits May 11, 2026 06:59
When libcuvs_c.so is built with cudart_static, CUDA symbols are embedded
but not exported to the dynamic symbol table. Panama FFI's SymbolLookup
cannot resolve them, causing an UnsatisfiedLinkError on cudaMemcpyAsync
at initialization time.

Add a static initializer in Util that loads libcudart.so before any CUDA
symbol resolution occurs. The loader tries in order: java.library.path
(via System.loadLibrary), each entry in LD_LIBRARY_PATH, and
/usr/local/cuda/lib64 as a final fallback. If all attempts fail, a
descriptive warning is logged. Builds with dynamic CUDA linkage are
unaffected.
Extend cuvsCagraSearchMultiSegment with a cuvsFilter parameter so callers
can apply a bitset prefilter.

C/C++ API changes:
- Add MULTI_SEGMENT_BITSET to cuvsFilterType enum
- Define cuvsMultiSegmentBitsetFilter (combined_bitset + segment_offsets)
  in common.h alongside the rest of the filter API
- Add cuvsFilter parameter to cuvsCagraSearchMultiSegment in both the C
  and C++ headers and implement dispatch in cagra.cpp / cagra.cuh

Java changes:
- Add FilterBitsetHandle: packs per-segment host bitsets and manages a
  single shared device allocation uploaded lazily on first use via
  double-checked locking
- Update MultiSegmentCagraSearch to accept a FilterBitsetHandle and pass
  the corresponding cuvsMultiSegmentBitsetFilter to the native search API
Extends generate_groundtruth with optional bitset prefiltering so that
exact-NN ground truth can be computed over a filtered subset of the
dataset.

New function create_bitset_filter(n_samples, filter_reject_rate) returns
a packed uint32 numpy array (LSB-first, matching cuVS bitset layout) where
bit i is set iff vector i passes the filter.  Uses a modulo-1000 bucket
scheme: vector i passes when i % 1000 >= round(filter_reject_rate * 1000),
giving a reject rate within 0.1% of the requested value.

calc_truth gains an optional bitset parameter:
- GPU path: slices the packed words for each batch, transfers to device,
  and passes as a cuVS bitset prefilter to brute_force.search
- CPU path: unpacks the same words to a boolean accept_mask and masks
  rejected distances to inf before argpartition

Two new mutually exclusive CLI arguments:
- --bitset <path>: load a pre-saved .npy uint32 bitset from file
- --filter_reject_rate <float>: generate a bitset in memory via
  create_bitset_filter
Previously the public search_multi_partition API exposed an intermediate
result form: callers passed per-partition vectors of matrix views for
queries, neighbors, and distances, and were responsible for merging the
per-partition top-k themselves (the cuvs-lucene wrapper allocated a
concatenated [num_partitions × k] buffer, called select_k, and decoded
positions into (partition_index, ordinal) pairs).

This commit moves all of that into cuVS. The merged form expresses
"search this queries matrix against every partition; return the global
top-k per query" with no intermediate result visible to callers.

C++ API (8 overloads in cagra.hpp):
- Single raft::device_matrix_view<const T, ...> queries [n_queries, dim]
- New output partition_ids [n_queries, k] (uint32) identifying which
  partition each neighbor came from
- Single neighbors [n_queries, k] of OutputIdxT (ordinal in the
  corresponding partition's dataset)
- Single distances [n_queries, k] of post-processed values

C API (cuvsCagraSearchMultiPartition):
- DLManagedTensor* for queries/partition_ids/neighbors/distances
  (no more DLManagedTensor** arrays)

Detail implementation (cagra_search.cuh):
- Allocates an internal [num_partitions, n_queries, k] intermediate
  buffer for neighbors and distances from the workspace pool
- Runs per-partition kernel into this buffer
- Applies per-partition distance post-processing on the intermediate
  buffer (scale + metric transform, preserving per-partition metric)
- Transposes intermediate distances to [n_queries, num_partitions * k]
  via raft::linalg::map_offset for a batched raft::matrix::select_k
- Decodes select_k positions into partition_ids and neighbors using
  raft::linalg::map / map_offset

Kernel layer:
- multi_partition_desc_t slimmed to (dataset_desc, graph, graph_degree);
  queries_ptr and result pointers are now shared kernel parameters,
  removing redundant per-partition state

Java:
- MultiPartitionCagraSearch.search(...) takes a single CagraQuery
  (not List<CagraQuery>); body collapses to three small allocations
  and a single native call
- Slow-path BitSet-from-CagraQuery handling removed; FilterBitsetHandle
  is now the only filter mechanism
- BufferedCagraSearch interface slimmed to getIndexHandle(); the
  searchIntoBuffer implementation in CagraIndexImpl is removed
@jamxia155

jamxia155 commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Pareto frontiers from initial benchmarking with prefiltering:

image image

Hardware: 32/64-core/thread CPU, L40S 48GB
Dataset: 50M x 64 dim x 16-bit
Top-K: 512
Prefilter reject rate: 84%

CAGRA_HNSW (CAGRA index converted to HNSW, search on CPU) parameter sweep ranges:

"efSearch": [512, 1024]
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [32, 64, 128, 256]
"cagraHnswLayers": [1, 2, 4, 6]

CAGRA_SEARCH (CAGRA index used directly for search on GPU) parameter sweep ranges:

"efSearch": [512]
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [32, 64, 128, 256]
"cagraSearchWidth": [4, 16, 128, 256]
"cagraITopK": [32, 64, 128, 256, 512]

…_size

search_multi_partition previously asked each partition's SINGLE_CTA kernel
to write `topk` neighbors into the intermediate buffer, regardless of the
plan's itopk_size. When the caller requested topk > itopk_size, the kernel's
shared-memory result buffer (sized itopk_size + max_candidates) was smaller
than topk and the result-extraction loop in search_core read out of bounds,
producing undefined behavior. This made the existing
`plan.itopk_size * num_partitions >= topk` feasibility check load-bearing
only on paper — it permitted configurations the kernel could not actually
execute.

Decouple the kernel's per-partition output count from the caller's global
topk inside search_multi_partition:

- Introduce `per_partition_topk = min(topk, plan.itopk_size)`
- Size the [num_partitions, n_queries, *] intermediate buffer to
  per_partition_topk (not topk) so the kernel stays within its
  shared-memory budget
- Launch run_multi_partition with per_partition_topk
- Adjust post-processing slice widths, transpose row/partition strides,
  and the select_k input width to use per_partition_topk
- Update position-decode arithmetic: partition index is
  pos / per_partition_topk and intra-partition slot is
  pos % per_partition_topk; output stride remains the global topk

When topk <= itopk_size, per_partition_topk == topk and every buffer
size, stride, and decode expression reduces to the prior values — the
common case is bit-identical. When topk > itopk_size, each partition
emits up to itopk_size candidates and the cross-partition select_k
assembles the final global top-k from num_partitions * itopk_size
candidates, which the existing feasibility check guarantees is at
least topk.

No changes to search_core, search_kernel_mp, run_multi_partition's
signature, or the single-partition search() path. plan->check(topk)
still enforces topk <= itopk_size for non-multi-partition callers.
@jamxia155 jamxia155 changed the title Multi segment cagra search Multi partition cagra search Jun 1, 2026
@jamxia155

jamxia155 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Pareto frontiers unfiltered search with top-k = 1500:

image image

Hardware: 32/64-core/thread CPU, L40S 48GB
Dataset: 50M x 64 dim x 16-bit
Top-K: 1500

CAGRA_HNSW (CAGRA index converted to HNSW, search on CPU) parameter sweep ranges:

"efSearch": [1500, 3000]
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [32, 64, 128, 256]
"cagraHnswLayers": [1, 2, 4, 6]

CAGRA_SEARCH (CAGRA index used directly for search on GPU) parameter sweep ranges:

"efSearch": 1500
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [16, 32, 64, 256]
"cagraSearchWidth": [1, 4, 8, 16, 32]
"cagraITopK": [1500, 2000, 3000]

jamxia155 added 9 commits June 1, 2026 14:15
search_multi_partition previously rewrote params.algo to SINGLE_CTA
unconditionally, which silently downgraded MULTI_CTA and AUTO callers
regardless of their intent and made the multi-partition path's algo
selection invisible to the rest of the code. Replace the override with
explicit algo resolution:

- Reject MULTI_CTA with RAFT_FAIL. The multi-partition merge below
  assumes a single contiguous per-partition output slice from one plan
  type; MULTI_CTA's plan and result layout are not compatible.
- Resolve AUTO based on itopk_size. SINGLE_CTA's hard itopk cap is 512;
  above that the search would fail at plan construction, so AUTO routes
  to MULTI_KERNEL. At or below 512 it stays on SINGLE_CTA (the existing
  default for this path).
- Hold MULTI_KERNEL with RAFT_FAIL "not yet implemented." The fused
  multi-partition MULTI_KERNEL path is staged; the dispatch wiring
  lands here so the remaining stages only have to fill in behavior.
Stage 1 set up the algo dispatch in search_multi_partition with a
hard fail on MULTI_KERNEL. This commit lands the actual MULTI_KERNEL
path as a sequential per-partition shim, letting MULTI_KERNEL serve
as the multi-partition workhorse when the caller needs itopk_size
beyond SINGLE_CTA's 512 cap.

Refactor of search_multi_partition:
- Lift plan_desc construction, stream acquisition, and intermediate
  buffer allocation out of the SINGLE_CTA-specific block. Both algos
  share these now; the buffer layout [num_partitions, n_queries, topk]
  is algo-agnostic and the post-processing reads it the same way.
- Replace the linear body with an `if (SINGLE_CTA) ... else /*
  MULTI_KERNEL */ ...` dispatch. The SINGLE_CTA branch retains the
  existing plan + feasibility check + per-partition descriptor setup
  + fused kernel launch, bit-identical to before.

MULTI_KERNEL branch:
- constexpr if guards out the filtered case with a clear RAFT_FAIL.
  Filter support lands alongside kernel-fusion work, where
  multi_partition_bitset_filter falls into place via blockIdx.z
  (the same mechanism SINGLE_CTA already uses).
- For each partition, build a partition-specific dataset_desc via
  dataset_descriptor_init_with_cache and construct a fresh
  multi_kernel_search::search with that partition's dataset_size and
  graph_degree. Call operator() with the partition's graph view,
  writing into the partition's slice of the intermediate buffer.

Per-partition plan construction is the simplest path to a functionally
correct sequential implementation: each partition's distance compute
uses that partition's own dataset descriptor, which a single shared
plan cannot do (a shared plan stores one dataset_desc, and reusing it
across partitions reads OOB when dataset sizes differ). Cost is
O(num_partitions) buffer allocations per search call; the descriptor
cache absorbs repeat plan_desc construction on hot indexes, so
steady-state penalty is mostly first-query. This is throwaway shim
code — the eventual kernel-fusion path replaces it with a single
plan and partition-aware kernels.
The previous check, `itopk_size * num_partitions >= topk`, permitted
configurations where each partition's per-query SINGLE_CTA kernel
would be asked to write more topk values than its shared-memory
result buffer (sized for itopk_size + search_width * graph_degree
entries) can hold. The kernel's result-extraction loop reads past
the buffer in that case, producing undefined behavior — typically
illegal-address faults visible only mid-search.

Replace with `topk <= plan.itopk_size`, matching the check
single-partition CAGRA already enforces via plan->check(topk) at
search_plan.cuh:367 (same message wording). Configurations needing
larger topk should pick MULTI_KERNEL, which routes automatically
once itopk_size > 512.
Stage 1 of fusing the multi-partition MULTI_KERNEL search path. Adds
four partition-aware kernel variants alongside the existing single-
partition kernels — runtime path is unchanged (cagra_search.cuh's
MULTI_KERNEL branch still uses the stage-2 sequential per-partition
shim). The next commit replaces that shim with a single fused call
that drives these kernels via a new run_multi_partition method.

Each _mp variant launches with a 3D grid that adds num_partitions as
the partition dimension (typically blockIdx.z), reads per-partition
data from a multi_partition_desc_t array indexed by that dimension,
and uses row = partition_id * num_queries + query_id to index the
per-(query, partition) slice of result, hashmap, and parent buffers.

- multi_partition_desc_t: per-partition descriptor (dataset_desc,
  graph, graph_degree). Duplicated in the multi_kernel_search
  namespace rather than reusing single_cta_search's version, so each
  algo can evolve its descriptor independently. If more consumers
  emerge, fold into a shared base.
- random_pickup_mp_kernel + launcher.
- pickup_next_parents_mp_kernel + launcher. terminate_flag stays a
  single global scalar; same-value writes from multiple (query,
  partition) work units are race-safe.
- compute_distance_to_child_nodes_mp_kernel + launcher. Per-partition
  graph_degree means the kernel's launch grid is sized for the max
  across partitions; partitions with smaller graph_degree leave
  trailing teams inactive. blockIdx.z = partition_id satisfies
  multi_partition_bitset_filter's contract automatically — filter
  support falls in without further changes.
- remove_parent_bit_mp_kernel + launcher.
Validates two of the four mp expansion kernels by comparing their
per-(query, partition) outputs against running the existing single-
partition kernels sequentially over each partition's slice.

- remove_parent_bit_mp_matches_sequential: confirms the mp variant's
  row arithmetic clears MSBs in the correct per-(query, partition)
  rows and leaves trailing entries beyond num_topk untouched.
- pickup_next_parents_mp_matches_sequential: confirms parent_list,
  parent_candidates (with MSB updates marking used parents), and
  terminate_flag all match the sequential reference. Runs with
  small_hash_bitlen = 0 to keep the input setup minimal.

Tests for random_pickup_mp and compute_distance_to_child_nodes_mp
are deferred — those kernels rely on a fully-constructed
dataset_descriptor_host, and any row-arithmetic bug surfaces just as
clearly as a recall regression in the eventual integration validation
against SINGLE_CTA.

tests/CMakeLists.txt: registers NEIGHBORS_ANN_CAGRA_MULTI_KERNEL_MP_TEST
with cpp/src and cpp/build/src added to its include path so it can
reach search_multi_kernel.cuh and the generated compute-distance
extension header. Existing tests only include public API headers, so
this is the first test target to need internal-source access.
Replaces the per-partition shim (which constructed a fresh
multi_kernel_search::search per partition and ran them one at a time)
with a fused implementation that drives all partitions in one search
call. Each kernel in the iteration loop now launches with a 3D grid
(queries, partitions, …); per-(query, partition) state is keyed by
row = partition_id * num_queries + query_id.
Reverts the partition-aware MULTI_KERNEL search path. MULTI_KERNEL is
a reference implementation, not deeply optimized; single-partition
AUTO never selects it, and multi-partition MULTI_KERNEL runs ~10x
slower than SINGLE_CTA in practice. The path forward for topk > 512
in multi-partition search is MULTI_CTA, not MULTI_KERNEL.

Reverts:
  4db0941 Fuse MULTI_KERNEL multi-partition into a single search call
  aa93362 Add unit tests for partition-aware MULTI_KERNEL kernels
  56e99ee Add partition-aware MULTI_KERNEL expansion kernels
  37a65d7 Tighten SINGLE_CTA multi-partition feasibility check
  cf679b9 Enable MULTI_KERNEL in multi-partition CAGRA search
  6b4e34f Route multi-partition CAGRA search by params.algo

The SINGLE_CTA tightening from 37a65d7 will be re-applied in a
follow-up commit.
The previous check, `itopk_size * num_partitions >= topk`, permitted
configurations where each partition's per-query SINGLE_CTA kernel
would be asked to write more topk values than its shared-memory
result buffer (sized for itopk_size + search_width * graph_degree
entries) can hold. The kernel's result-extraction loop reads past
the buffer in that case, producing undefined behavior — typically
illegal-address faults visible only mid-search.

Replace with `topk <= plan.itopk_size`, matching the check
single-partition CAGRA already enforces via plan->check(topk) at
search_plan.cuh:367 (same message wording).
Comment on lines +1748 to +1835
void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices,
raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
raft::device_matrix_view<const half, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices,
raft::device_matrix_view<const half, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices,
raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

void search_multi_partition(
raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices,
raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries,
raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids,
raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
raft::device_matrix_view<float, int64_t, raft::row_major> distances,
const cuvs::neighbors::filtering::base_filter& sample_filter =
cuvs::neighbors::filtering::none_sample_filter{});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new set of overloads specific for CAGRA, maybe we'd better introduce a new index type, e.g. multi_partition_cagra? This whole thing looks like it can be represented as a "virtual merge" index we've already introduced once.
This has couple benefits:

  • Public API stays the same homogeneous - no new search functions for a user to discover - just a new index type and its constructor.
  • the new index search would be composable with dynamic batching (which takes some index and calls search on it).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @achirkin, for the insightful comment. Assuming that composite_index is the "virtual merge" index you mentioned, it does look very attractive as an existing infrastructure for multi-partition search. There are two wrinkles that prevent immediate adoption:

  1. composite_index implements the per-index search in a for loop, which does not expose partition-level parallelism needed to maximize device utilization when the batch size is 1 as is always the case in cuVS-Lucene. For the current work, we actually had to create multi-partition versions of the entire call stack down to the kernels and device functions (e.g. search_multi_cta_mp_jit). Now, the multi-partition code can in theory entirely subsume the original single-partition code by essentially setting num_partition=1 as needed but I don't think we should make such a sweeping change in this PR.
  2. For cuVS-Lucene, we need to produce the partition-local ordinals of the neighbors instead of the global indices. However, composite_index internally consumes the ordinals to produce only the global indices.

Both issues above can possibly be resolved with additional configuration parameters and corresponding code paths in composite_index.

On the other hand, making multi_partition_cagra its own header probably involves easier changes. Do you think we should make a push for extending/customizing composite_index at this time?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new set of overloads specific for CAGRA, maybe we'd better introduce a new index type, e.g. multi_partition_cagra? This whole thing looks like it can be represented as a "virtual merge" index we've already introduced once.

I don't know that I agree with this. This needs to be a phyiscal merge, not a logical merge. This is critical for LSM-based databases and it'll be widely used outside of just Lucene. I'm not saying we can't (and shouldn't) improve the API experience wherever possible, but I'm not convinced the "logical merge" (or composite index) is not the way to do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Corey, I totally agree with you comment. I meant to say structurally the code is very similar and I think it is a good pattern to use for this case too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your thinking here for sure- if there's a good reusable abstraction for this that can consolidate APIs (ideally without introducing a complex class hierarchy on the cuvs impl side) then I'm all for it. I say "cuvs impl side" because abstractions like device_resources, while they have established a hierarchy on the user side, are just meant to be simple user convenience and we use raft::resources container on the implementation side).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this discussion, I'm planning to refactor this work into a multi_partition_cagra index type that is constructed from a vector of CAGRA indices (like composite_index) and provides a search method that implements the current optimizations (unlike composite_index which doesn't really implement the search at a low-level). Does that sound about right? Does it defeat the purpose somewhat given that the signature of this search method can't conform to the signature common to other index types as we need an output array to indicate the id of the CAGRA index that a result comes from? Thanks for your thoughts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I would not suggest this design @jamxia155. Let's keep this simple. Index shoukd also not have search methods (we use free functions in cuVS not methods on objects).

Again to reiterate- composite index is going away. We should not be using that as a guide for designs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further discussions with @cjnolet, the plan is to keep things simple and essentially rename search_multi_partition as search for consistency, while allowing for necessary parameters in the signature even if they are not common to other search functions.

jamxia155 added 3 commits June 8, 2026 12:59
… stride

 The multi-partition post-processing in search_multi_partition assumed the
 kernel writes exactly `topk` candidates per partition (the SINGLE_CTA mp
 convention). Generalize this assumption so other algos can emit a
 different number of candidates per (query, partition) — useful for paths
 that skip the per-partition top-k merge and rely entirely on the
 cross-partition select_k.

 Introduces `per_partition_topk` (= `topk` for SINGLE_CTA, no functional
 change). Replaces topk-as-stride usages with per_partition_topk:

 - intermediate buffer sizing and per-partition slice width
 - per-partition distance post-processing slice shape
 - transpose row stride and per-partition stride arithmetic
 - batched select_k input width (num_partitions * per_partition_topk)
 - position decode: pos / per_partition_topk → partition_id,
   pos % per_partition_topk → slot within partition

 The caller-owned output buffers (neighbors, distances, partition_ids,
 positions_buf) keep stride `topk` because that's the requested output
 shape. The position-decode lambda now disambiguates the two strides
 explicitly (topk_i64 for output indexing, per_partition_topk_i64 for
 intermediate indexing).
Adds a fused multi-partition launcher to multi_cta_search::search and
wires it into cagra::detail::search_multi_partition alongside the
existing SINGLE_CTA path. MULTI_CTA spawns ceildiv(itopk_size, 32) CTAs
per query, so it can serve large itopk pools that SINGLE_CTA's
itopk_size <= 512 cap rejects (e.g. topk=1500).

Kernel: search_kernel_mp runs on a 3D grid
(num_cta_per_query, num_queries, num_partitions). blockIdx.z indexes
multi_partition_desc_t to fetch each partition's dataset_desc, graph,
and graph_degree; shared memory and the result buffer are sized for
the max graph_degree across partitions. The cross-CTA traversed
hashmap is sized for (num_queries * num_partitions) rows so that CTAs
serving the same query across different partitions never collide.

Output layout matches SINGLE_CTA multi-partition:
[num_partitions, num_queries, per_partition_topk] partition-major,
with per_partition_topk = num_cta_per_query * itopk_size for MULTI_CTA
(vs. topk for SINGLE_CTA). No per-partition top-k merge is performed
in the MULTI_CTA path -- the existing cross-partition select_k in
search_multi_partition picks the final global top-k in one pass.

Dispatch in search_multi_partition:
  - Forces persistent=false (multi-partition does not use persistent).
  - Rejects MULTI_KERNEL explicitly (reference impl, far slower).
  - Resolves AUTO to MULTI_CTA when itopk_size > 512, else SINGLE_CTA.
  - Branches on algo: each branch builds its own plan, per-partition
    descriptor array, and intermediate buffer, then calls
    run_multi_partition.

Shared post-processing (per-partition distance transform, cross-
partition select_k, partition_ids decode) operates uniformly on
per_partition_topk and serves both algos unchanged.
Multi-partition search_multi_partition's AUTO resolution previously
picked SINGLE_CTA whenever itopk_size <= 512, mirroring only half of
single-partition's heuristic in search_plan_impl_base. The other half
-- requiring max_queries >= num_sm * 2 before picking SINGLE_CTA --
guards against underfilling the GPU when there are too few queries to
saturate the SMs.

The same concern applies in multi-partition, but the partition axis
already pads the grid: SINGLE_CTA produces (1 * num_queries *
num_partitions) CTAs. Scale the gate by num_partitions accordingly:
AUTO picks SINGLE_CTA only when max_queries * num_partitions >=
num_sm * 2 (and itopk_size <= 512); otherwise it falls through to
MULTI_CTA, whose ceildiv(itopk_size, 32) CTAs per query restore
occupancy.
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…earch

Adapt the branch's multi-segment / multi-partition CAGRA additions to
the new JIT-LTO kernel infrastructure landed in NVIDIA#1807. After the merge,
multi-partition search runs through JIT-linked fragments just like the
rest of CAGRA, with parity across filter types (none, bitset, mp_bitset).

- Port deleted *_kernel-inl.cuh contents into the JIT layout: device
  bodies in jit_lto_kernels/*_jit.cuh, .cu.in entry-points, matrix JSONs,
  fragment tags, planners, factory functions, host launchers, CMake
  registration.
- Introduce mp_bitset_filter_data_t + tag_filter_mp_bitset + matching
  sample_filter_mp_bitset_impl so multi_partition_bitset_filter is
  recognized end-to-end without coupling to the standard bitset POD.
- Add BitsetT template parameter to search_core so it accepts either
  cagra_bitset or mp_cagra_bitset without doubling instantiations.
- Add CUVS_EXPORT to four C entry points that were silently hidden:
  cuvsRMMAsyncMemoryResourceEnable, cuvsResourcesSetWorkspacePool,
  cuvsCagraSearchMultiPartition, cuvsSelectK.
- Update JDKProvider.java to drop the stale specific import for
  cudaStreamSynchronize that jextract has reshuffled into headers_h.
@jamxia155

Copy link
Copy Markdown
Contributor Author

Pareto frontiers 84% filtered search with top-k = 1500:

image image

Hardware: 32/64-core/thread CPU, L40S 48GB
Dataset: 50M x 64 dim x 16-bit
Top-K: 1500

CAGRA_HNSW (CAGRA index converted to HNSW, search on CPU) parameter sweep ranges:

"efSearch": [1500, 2000, 3000]
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [16, 32, 64, 128, 256]
"cagraHnswLayers": [1, 2]

CAGRA_SEARCH (CAGRA index used directly for search on GPU) parameter sweep ranges:

"efSearch": 1500
"queryThreads": [8, 16, 64]
"cagraGraphDegree": [16, 32, 64, 128]
"cagraIntermediateGraphDegree": [16, 32, 64, 128, 256]
"cagraSearchWidth": [1, 4, 8, 16, 32]
"cagraITopK": [1500, 2000, 3000]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants