Multi partition cagra search#2035
Conversation
Introduces a globally-optimized path for searching across multiple CAGRA
index segments without per-segment device-to-host copies.
- c/include/cuvs/selection/select_k.h, c/src/selection/select_k.cpp:
new C API function cuvsSelectK that wraps cuvs::selection::select_k
for selecting the k smallest float values from a flat device array
- BufferedCagraSearch (internal interface): searchIntoBuffer() writes
per-segment CAGRA results into a slice of a caller-owned device buffer
without syncing the stream or copying to host
- CagraIndexImpl: implements BufferedCagraSearch; searchIntoBuffer()
computes byte offsets into the global buffer using segmentIdx * topK
- SelectKHelper: Panama FFI binding for cuvsSelectK
- MultiSegmentCagraSearch: orchestrates the full pipeline — queue all
per-segment searches, sync once, run cuvsSelectK on GPU, sync again,
single D2H copy, decode results
- MultiSegmentSearchResults: simple result carrier with count,
segmentIndices, ordinals, and distances arrays
Implements concurrent per-segment GPU search that eliminates per-segment
device-to-host copies and CPU blocking on workspace deallocation.
Key changes:
- CudaStreamPool: fixed-size pool of non-blocking CUDA streams (one
cuvsResources_t per slot). Segments are assigned to slots via
round-robin so searches on different slots run concurrently. Pool size
defaults to 8 and is overridden via the system property
com.nvidia.cuvs.streamPoolSize.
- MultiSegmentCagraSearch: single-query search across N index segments
using the stream pool. All per-segment CAGRA kernels write into a
shared device buffer (no per-segment D2H copy or stream sync). A
single cuvsSelectK call finds the global top-k entirely on GPU, then
one D2H copy transfers the results.
- BufferedCagraSearch / CagraIndexImpl: new searchIntoBuffer() method
that queues a CAGRA search kernel on a caller-supplied stream and
writes results at a given row offset into a pre-allocated device
buffer.
- cuvsRMMAsyncMemoryResourceEnable() (C API + Java bindings): switches
the current device memory resource to cuda_async_memory_resource so
that workspace deallocations issued by CAGRA's search plan destructor
are stream-ordered and non-blocking. Without this, cudaFree serializes
kernel launches across streams regardless of stream assignment,
nullifying the stream pool benefit.
CudaStreamPool was a static singleton shared across all threads. With
multiple concurrent query threads, calls to MultiSegmentCagraSearch.search()
would alias onto the same pool slots after Math.floorMod, causing concurrent
cudaEventRecord and cudaStreamWaitEvent calls on the same event handle
(undefined behavior) and cross-thread stream interference. Additionally,
CudaStreamPool.closeInstance() was called from CuVSResourcesImpl.close(),
so whichever thread closed its resources first would destroy the shared pool
while other threads were still using it.
Fix: make CudaStreamPool a per-CuVSResources instance rather than a static
singleton. One pool is created and owned by each CuVSResourcesImpl; it is
closed when that instance is closed. Since CuVSResources is thread-local in
the Lucene integration, each query thread gets its own independent set of
streams and events with no sharing or locking required.
- CudaStreamPool: remove static singleton (getOrCreate, closeInstance, static
volatile instance, static AtomicInteger slotCounter); add package-private
constructor; replace static slotCounter with an instance int and a new
nextSlot(int count) method.
- CuVSResourcesImpl: add final CudaStreamPool streamPool field (sized from
com.nvidia.cuvs.streamPoolSize system property); close it directly in
close(); add static getStreamPool(CuVSResources) helper for
MultiSegmentCagraSearch to retrieve the per-resources pool.
- MultiSegmentCagraSearch: get pool via CuVSResourcesImpl.getStreamPool
and advance via pool.nextSlot. Remove redundant cuvsStreamSync after
cuvsSelectK — the D2H copies are enqueued on the same stream so CUDA
ordering already serializes them. Replace three separate hostArena.allocate
calls with one contiguous allocation (Long.BYTES-aligned) sliced into three
typed views, reducing OS-level allocation overhead per query.
The persistent kernel runner was previously keyed on (dataset_desc, graph, fixed search params), which forced a destroy/recreate cycle for every segment when searching a multi-segment Lucene index: each segment has a different graph pointer and potentially a different auto-computed max_iterations, producing a different hash on every call. C++ changes (search_single_cta_kernel-inl.cuh): - Move dataset_desc_ptr, graph_ptr, and graph_degree from fixed runner state into per-job fields in job_desc_t. The persistent kernel reads them from the job descriptor, so one runner instance can serve any number of segments without being rebuilt. - Remove dataset_desc and graph arguments from persistent_runner_t constructor and calculate_parameter_hash; the runner is now keyed only on fixed kernel parameters (block_size, smem_size, itopk, etc.). - Update select_and_run to initialize the device descriptor on the caller's stream and synchronize before submission, then pass dd_dev_ptr, graph.data_handle(), and graph_degree to runner::launch. - Remove dd_host from persistent_runner_t; dataset upload is now the caller's responsibility on each launch. Java changes: - Add persistent, persistentLifetime, and persistentDeviceUsage fields, getters, and Builder methods to CagraSearchParams. - Wire the three persistent params through CuVSParamsHelper into the Panama-generated cuvsCagraSearchParams struct.
…raSearch In persistent mode, `searchIntoBuffer` blocks on the CPU until the GPU signals completion via a system-scope atomic. Previously, segments were searched sequentially, so the GPU processed one segment at a time per query, leaving its job queue mostly idle between segment dispatches. Submit one async task per pool slot so all slots' segment searches are in-flight simultaneously. The persistent runner's job queue can hold all N segment jobs at once, allowing GPU workers to execute segments in parallel (bounded by worker_queue_size). Segments are grouped by pool slot rather than submitted one-per-segment to prevent concurrent access to the same cuvsResources_t handle: the descriptor_cache stored inside the RAFT resources object is not thread-safe, and multiple threads calling cuvsCagraSearch with the same handle causes a SIGSEGV. Grouping ensures each cuvsResources_t is accessed by at most one thread at a time. Effective parallelism is min(numSegments, pool.size()); increasing cuvsStreamPoolSize raises the ceiling. In non-persistent mode the existing sequential loop is unchanged: kernel launches are asynchronous and return immediately, so Java-level parallelism adds overhead without benefit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multi-segment search (C++/C/Java): - Add search_multi_segment() C++ function: builds a single search plan sized for the largest segment, packs per-segment descriptors, and launches search_kernel_ms — a new SINGLE_CTA kernel with grid dimensions (1, num_queries, num_segments) so each CTA independently searches one (query, segment) pair in a single kernel call. - Add cuvsCagraSearchMultiSegment() C API wrapping the above. - Simplify MultiSegmentCagraSearch.java to call cuvsCagraSearchMultiSegment() unconditionally, removing the previous per-segment stream pool, thread pool, CUDA event synchronization, and persistent/non-persistent branching. The search now completes in four phases: multi-segment kernel, GPU-side select-k, single D2H copy, result decoding. - Add BufferedCagraSearch.getIndexHandle() to expose the raw cuvsCagraIndex_t handle needed by the multi-segment kernel dispatch. Workspace pool (C/Java): - Add cuvsResourcesSetWorkspacePool(): configures the per-resources temporary workspace as an uncapped RMM pool that grows without shrinking. After warmup, cuvsRMMAlloc/cuvsRMMFree hit the pool cache instead of calling cudaMallocAsync/cudaFreeAsync, eliminating CUDA context lock contention under concurrent query threads. - Route cuvsRMMAlloc/cuvsRMMFree through the workspace resource so Java- side output buffer allocations also benefit from the pool. - Expose setWorkspacePool() in the CuVSResources Java interface with implementations in CuVSResourcesImpl and SynchronizedCuVSResources. Refactoring (search_single_cta_kernel-inl.cuh): - Extract TopkVariant enum and select_topk_variant() helper, shared by search_kernel_config and search_kernel_config_ms, replacing duplicated if/else trees in both choose_itopk_and_mx_candidates() bodies. - Extract kernel_dispatch_params::compute() to centralize max_candidates and max_itopk computation shared by select_and_run and select_and_run_multi_segment. - Extract hashmap_element_count() static helper on the search struct, used by both set_params (single-segment) and run_multi_segment.
Add HALF(2) to the DataType enum and wire it through the Java layer so
callers can build and exchange float16 matrices without any Java-side
type conversion:
- CuVSMatrix.Builder: new addVector(short[]) overload; each element is
a raw IEEE 754 binary16 bit pattern held in a short
- LinkerHelper: add C_SHORT via canonicalLayouts("short")
- CuVSMatrixBaseImpl: map HALF → C_SHORT in valueLayoutFromType();
decode DLPack (kDLFloat, bits=16) → DataType.HALF in
dataTypeFromTensor()
- CuVSMatrixInternal: HALF maps to the same kDLFloat DLPack type code
as FLOAT; the bits field (16 vs 32) distinguishes them on the C side
- JDKProvider.MatrixBuilder: implement addVector(short[]) following the
same MemorySegment.ofArray pattern as the other overloads
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis pull request introduces multi-segment CAGRA search functionality, GPU-based selection, and RMM memory pool management. It adds C and C++ APIs for multi-segment approximate nearest neighbor search, async memory allocation, and select-K operations, alongside corresponding Java bindings and internal implementations supporting concurrent per-segment GPU processing and stream-based result aggregation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)
293-322:⚠️ Potential issue | 🟡 MinorThe
toString()method does not include the new persistent fields.The three new fields (
persistent,persistentLifetime,persistentDeviceUsage) are not included in thetoString()output. While not critical, this creates an inconsistency where debugging output won't show the full configuration.🔧 Proposed fix to include persistent fields in toString()
+ ", randXORMask=" + randXORMask + + ", persistent=" + + persistent + + ", persistentLifetime=" + + persistentLifetime + + ", persistentDeviceUsage=" + + persistentDeviceUsage + "]";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around lines 293 - 322, The toString() in class CagraSearchParams currently omits the new fields persistent, persistentLifetime, and persistentDeviceUsage; update the CagraSearchParams.toString() method to append these three fields (with their names and values) into the returned string (same formatting style as the other fields) so debugging output shows the full configuration.
🧹 Nitpick comments (3)
java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java (1)
59-62: Add destroyed-state guard in the new delegate method.
setWorkspacePool(...)should callcheckNotDestroyed()before delegating, consistent with this wrapper’s defensive contract.Proposed change
`@Override` public void setWorkspacePool(long sizeBytes) { + checkNotDestroyed(); inner.setWorkspacePool(sizeBytes); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java` around lines 59 - 62, The new delegate method setWorkspacePool(long sizeBytes) is missing the wrapper's destroyed-state guard; modify CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start (before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the class's defensive contract and uses the same destroyed-state check as other methods.java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java (1)
527-537: Consider adding validation forpersistentDeviceUsagebounds.The Javadoc states the value "must be greater than 0.0 and not greater than 1.0", but the setter doesn't validate this constraint. Invalid values would only fail at the native layer.
🛡️ Optional: Add validation in the builder
public Builder withPersistentDeviceUsage(float persistentDeviceUsage) { + if (persistentDeviceUsage <= 0.0f || persistentDeviceUsage > 1.0f) { + throw new IllegalArgumentException( + "persistentDeviceUsage must be > 0.0 and <= 1.0, got: " + persistentDeviceUsage); + } this.persistentDeviceUsage = persistentDeviceUsage; return this; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java` around lines 527 - 537, The builder setter withPersistentDeviceUsage currently assigns persistentDeviceUsage without enforcing the Javadoc constraint; update Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f and <= 1.0f and throw an IllegalArgumentException (including the invalid value in the message) when the check fails so invalid values are rejected early before reaching native code.cpp/src/neighbors/detail/cagra/cagra_search.cuh (1)
388-417: Consider hoisting query_norms allocation outside the loop for CosineExpanded metric.When multiple segments use
CosineExpandedmetric,query_normsis allocated and computed redundantly for each segment. Since queries are the same across segments (repeated query vector), the norms could be computed once.This is a minor optimization opportunity. The current implementation is correct; the redundant computation is bounded by the number of segments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh` around lines 388 - 417, Hoist the allocation and computation of query_norms out of the segment loop and reuse it for every segment whose indices[i]->metric() == cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the shared queries data to compute the norms, then inside the loop call raft::linalg::matrix_vector_op (as currently done) using the precomputed query_norms for each CosineExpanded segment; after the loop free or let query_norms go out of scope. Ensure you still call cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded branches and keep all existing ops (raft::compose_op, raft::sq_op, raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/include/cuvs/core/c_api.h`:
- Around line 229-240: The current implementation of
cuvsRMMAsyncMemoryResourceEnable stores cuda_async_memory_resource in a
thread_local async_mr and passes it to rmm::mr::set_current_device_resource(),
creating a lifetime mismatch vs the API doc that says the change is global;
change the implementation so the async memory resource is process-scoped (not
thread_local) by replacing thread_local async_mr with a static/process-global
instance (e.g., a static unique_ptr or static object) so it outlives the thread
and remains valid for rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.
In `@c/src/neighbors/cagra.cpp`:
- Around line 710-733: The loop currently casts every indices[i]->addr and
builds device views without validating each segment's types/devices; update the
loop that fills idx_vec, q_vec, n_vec, d_vec to perform the same per-segment
checks used in cuvsCagraSearch: assert indices[i] != nullptr and addr != 0
(already present), then verify indices[i]->dtype.code == kDLFloat &&
indices[i]->dtype.bits == 32 for every i, and validate that queries[i],
neighbors[i], and distances[i] are device-backed DLPack tensors with the
expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
- Around line 726-736: The loop that builds idx_vec/q_vec/n_vec/d_vec must also
validate that all segment indices use the same distance metric as the first
segment to prevent mixing incompatible metrics; inside the for-loop in cagra.cpp
(after the existing RAFT_EXPECTS that checks indices[i] non-null) compare
indices[i]->metric (or the actual metric field name on your IndexT struct/class)
to indices[0]->metric and fail fast with RAFT_EXPECTS (or equivalent) and a
clear message like "Mixed distance metrics across segments: expected %s but got
%s at segment %u"; keep this check before pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.
In `@c/src/selection/select_k.cpp`:
- Around line 14-40: cuvsSelectK dereferences shape[1] and casts buffers without
validating the DLPack tensors; add explicit validation at the top of cuvsSelectK
for in_val, out_val, out_idx (non-null), then check each
DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows match expected
1 or compatible), dtype (in_val/out_val float32, out_idx int64), device type
(CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.
In `@cpp/include/cuvs/neighbors/cagra.hpp`:
- Around line 1752-1806: Add Doxygen documentation for each overloaded
search_multi_segment declaration so they appear in generated API docs; either
add brief doxygen blocks above each overload or use `@copydoc` to reference the
primary search_multi_segment doc block (e.g., use `@copydoc`
search_multi_segment(raft::resources const&,
cuvs::neighbors::cagra::search_params const&, const std::vector<const
cuvs::neighbors::cagra::index<float, uint32_t>*>&, const
std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>&,
const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>&))
for the overloads with half/int8_t/uint8_t and uint32_t/int64_t neighbor types
so each signature (the overloads of search_multi_segment) is documented.
- Around line 1744-1806: The header declares the template parameter order as <T,
OutputIdxT, IdxT> but the instantiation macro (used with <T, IdxT, OutputIdxT>
e.g. (data_t, uint32_t, int64_t)) expects <T, IdxT, OutputIdxT>, causing the
IdxT/OutputIdxT swap that trips the static_assert in cagra_search.cuh:276 and
breaks link-time overloads for int64_t; fix by changing the template parameter
order in the search_multi_segment declarations to <T, IdxT, OutputIdxT> (or
alternatively update the instantiation macro to match the declared order) so
types map correctly, and add missing Doxygen for overloads 2–8 by inserting a
`@copydoc` search_multi_segment (or equivalent documentation block) above each of
those overloaded search_multi_segment declarations to satisfy the public API
docs requirement.
In `@cpp/src/neighbors/cagra.cuh`:
- Around line 409-420: The wrapper search_multi_segment currently forwards
indices, queries, neighbors, and distances without validation; add the same
upfront shape checks used by search() before calling
cagra::detail::search_multi_segment: verify indices.size() == queries.size() ==
neighbors.size() == distances.size(), then for each segment i ensure
queries[i].n_rows == neighbors[i].n_rows, neighbors[i].n_cols ==
distances[i].n_cols (k matches), queries[i].n_cols equals the index dimension
for indices[i] (or indices[i]->dim() / appropriate accessor), and that k > 0 and
dims are consistent; if any check fails, return/throw a clear error (or use
RAFT/CUASSERT used elsewhere) rather than forwarding to the detail
implementation.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.java`:
- Around line 60-76: The Java API should validate workspace pool sizes before
calling native code: update the Javadoc for setWorkspacePool to state the valid
range is > 0, and in the implementation of setWorkspacePool (the method that
currently invokes the native cuvsResourcesSetWorkspacePool) add a check that
rejects non-positive values (<= 0) by throwing an appropriate Java exception
(e.g., IllegalArgumentException) with a clear message; only call
cuvsResourcesSetWorkspacePool when the value is positive to avoid
signed->unsigned wraparound in native size_t.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.java`:
- Around line 24-27: MultiSegmentSearchResults currently stores native uint32
ordinals in an int[] (field ordinals) which corrupts values > Integer.MAX_VALUE;
change ordinals from int[] to long[] (and any constructor/getter signatures) and
ensure code that decodes native uint32_t values writes unsigned values into the
long (e.g., value & 0xFFFFFFFFL) so ordinals remain non-negative; update any
consumers (notably MultiSegmentCagraSearch) to compare against a long sentinel
(e.g., -1L) or otherwise handle long ordinals instead of treating negative ints
as sentinels.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java`:
- Around line 43-46: The setWorkspacePool method in SynchronizedCuVSResources is
not using the shared lock and must be serialized like access(); modify
setWorkspacePool to acquire the same lock used by access() (e.g., wrap the call
to inner.setWorkspacePool(sizeBytes) in the synchronized block or lock guard
used by access()) so that mutations to workspace pool are protected by the same
synchronization as access().
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java`:
- Around line 391-396: The offset calculation uses segmentIdx * topK
(neighborByteOffset/distanceByteOffset) but the code builds tensors with shape
{numQueries, topK} (numQueries from queryVectors.size()), so when numQueries > 1
the buffer offsets are wrong; either enforce single-query by adding a guard (if
(numQueries != 1) throw new IllegalArgumentException(...)) near where numQueries
is computed (queryVectors) or update the offsets to multiply by numQueries
(neighborByteOffset = segmentIdx * numQueries * topK * C_INT_BYTE_SIZE and
distanceByteOffset = segmentIdx * numQueries * topK * Float.BYTES) before
creating neighborSlice/distanceSlice (globalNeighborsDP/globalDistancesDP).
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`:
- Around line 115-119: Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
- Around line 125-131: In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
- Around line 59-85: The constructor CudaStreamPool currently leaks native
handles if a create call fails mid-loop; modify the CudaStreamPool(int size)
constructor to perform rollback cleanup on failure by tracking the current index
and, if any checkCudaError/checkCuVSError throws, iterating over
already-initialized entries in resources[], streams[], and events[] to call the
corresponding destroy functions (cudaStreamDestroy for streams[],
cudaEventDestroy for events[], and the cuvs resources destroy routine for
resources[]) before rethrowing the exception; implement this by wrapping the
allocation loop in try/catch (or try/finally with a success flag) and invoking
the same cleanup logic as close() for indices < currentIndex so no native
handles are leaked if construction aborts.
In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`:
- Around line 31-33: The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`:
- Around line 119-145: The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
- Around line 96-111: The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.
---
Outside diff comments:
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 293-322: The toString() in class CagraSearchParams currently omits
the new fields persistent, persistentLifetime, and persistentDeviceUsage; update
the CagraSearchParams.toString() method to append these three fields (with their
names and values) into the returned string (same formatting style as the other
fields) so debugging output shows the full configuration.
---
Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/cagra_search.cuh`:
- Around line 388-417: Hoist the allocation and computation of query_norms out
of the segment loop and reuse it for every segment whose indices[i]->metric() ==
cuvs::distance::DistanceType::CosineExpanded: allocate query_norms once (using
raft::make_device_vector) before the for-loop, run raft::linalg::reduce on the
shared queries data to compute the norms, then inside the loop call
raft::linalg::matrix_vector_op (as currently done) using the precomputed
query_norms for each CosineExpanded segment; after the loop free or let
query_norms go out of scope. Ensure you still call
cuvs::neighbors::ivf::detail::postprocess_distances for non-CosineExpanded
branches and keep all existing ops (raft::compose_op, raft::sq_op,
raft::div_const_op, raft::cast_op, raft::add_const_op, raft::div_checkzero_op)
unchanged.
In `@java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.java`:
- Around line 527-537: The builder setter withPersistentDeviceUsage currently
assigns persistentDeviceUsage without enforcing the Javadoc constraint; update
Builder.withPersistentDeviceUsage to validate that persistentDeviceUsage > 0.0f
and <= 1.0f and throw an IllegalArgumentException (including the invalid value
in the message) when the check fails so invalid values are rejected early before
reaching native code.
In `@java/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java`:
- Around line 59-62: The new delegate method setWorkspacePool(long sizeBytes) is
missing the wrapper's destroyed-state guard; modify
CheckedCuVSResources.setWorkspacePool to call checkNotDestroyed() at the start
(before delegating to inner.setWorkspacePool(sizeBytes)) so it matches the
class's defensive contract and uses the same destroyed-state check as other
methods.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 626e7fa1-95fc-4f40-b30d-d7fbcb6521a6
📒 Files selected for processing (40)
c/CMakeLists.txtc/include/cuvs/core/c_api.hc/include/cuvs/neighbors/cagra.hc/include/cuvs/selection/select_k.hc/src/core/c_api.cppc/src/neighbors/cagra.cppc/src/selection/select_k.cppcpp/include/cuvs/neighbors/cagra.hppcpp/src/neighbors/cagra.cuhcpp/src/neighbors/cagra_search_inst.cu.incpp/src/neighbors/detail/cagra/cagra_search.cuhcpp/src/neighbors/detail/cagra/search_single_cta.cuhcpp/src/neighbors/detail/cagra/search_single_cta_inst.cuhcpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuhcpp/src/neighbors/detail/cagra/search_single_cta_kernel.cuhjava/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CagraSearchParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSAceParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSMatrix.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/CuVSResources.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswAceParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndex.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/HnswIndexParams.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/MultiSegmentSearchResults.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/spi/CuVSProvider.javajava/cuvs-java/src/main/java/com/nvidia/cuvs/spi/UnsupportedProvider.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BufferedCagraSearch.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixBaseImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSMatrixInternal.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSParamsHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/HnswIndexImpl.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/SelectKHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/LinkerHelper.javajava/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.javajava/cuvs-java/src/test/java/com/nvidia/cuvs/CheckedCuVSResources.java
| /** | ||
| * @brief Switches the working memory resource to use stream-ordered asynchronous allocation | ||
| * (cudaMallocAsync / cudaFreeAsync). Unlike the pool resource, this resource returns memory to | ||
| * the stream immediately without blocking the CPU, eliminating device-wide synchronization on | ||
| * deallocation. This is especially beneficial when multiple CAGRA searches run concurrently on | ||
| * separate CUDA streams, because the internal workspace allocations no longer serialize kernel | ||
| * launches. Be aware that this function will change the memory resource for the whole process | ||
| * and the new memory resource will be used until explicitly changed. | ||
| * | ||
| * @return cuvsError_t | ||
| */ | ||
| cuvsError_t cuvsRMMAsyncMemoryResourceEnable(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "c_api.cpp" -path "*/c/src/core/*" | head -5Repository: rapidsai/cuvs
Length of output: 79
🏁 Script executed:
cat -n c/src/core/c_api.cpp | sed -n '160,195p'Repository: rapidsai/cuvs
Length of output: 1665
🏁 Script executed:
cd c/src/core && grep -n "thread_local" c_api.cppRepository: rapidsai/cuvs
Length of output: 269
🏁 Script executed:
cd c/include/cuvs/core && grep -n "cuvsRMMAsyncMemoryResourceEnable" c_api.h -A 15Repository: rapidsai/cuvs
Length of output: 580
🏁 Script executed:
# Search for RMM set_current_device_resource usage patterns in the codebase
rg "set_current_device_resource" --type cpp -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 2101
🏁 Script executed:
# Check if there are any docs changes or migration guides in the PR
find . -name "*.md" -o -name "*.rst" | xargs grep -l "RMM\|AsyncMemory\|cuvsRMM" 2>/dev/null | head -10Repository: rapidsai/cuvs
Length of output: 200
🏁 Script executed:
# Check for any deprecation warnings or migration documentation
find . -type f \( -name "*.h" -o -name "*.md" \) -path "*/docs/*" | xargs grep -l "cuvsRMM\|memory.*resource" 2>/dev/null | head -5Repository: rapidsai/cuvs
Length of output: 89
🏁 Script executed:
# Check if there are any API change notes or migration guides in the docs
cat docs/source/api_basics.rst | head -100Repository: rapidsai/cuvs
Length of output: 2518
🏁 Script executed:
# Check CHANGELOG for any RMM API changes mentioned
rg "Async|AsyncMemory|cuvsRMMAsync" CHANGELOG.md -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Check if reset_current_device_resource clears the async_mr
cat -n c/src/core/c_api.cpp | sed -n '192,210p'Repository: rapidsai/cuvs
Length of output: 784
🏁 Script executed:
# Check how pool memory resource is handled - it passes rvalues
# Let's see if there's any documentation on RMM's set_current_device_resource behavior
rg "set_current_device_resource" --type h | head -20Repository: rapidsai/cuvs
Length of output: 331
🏁 Script executed:
# Look for RMM in vendored code or requirements to understand its semantics
find . -name "*.txt" -o -name "*.cmake" | xargs grep -i "rmm" | head -10Repository: rapidsai/cuvs
Length of output: 779
🏁 Script executed:
# Check if there's any documentation update about async memory resource being process-wide
rg "cuvsRMMAsyncMemoryResourceEnable" docs/ 2>/dev/null || echo "No docs match found"Repository: rapidsai/cuvs
Length of output: 76
🏁 Script executed:
# Let's check how the pool resource implementation differs from async
# The pool passes a rvalue (temporary), while async passes a pointer to thread_local
# Search for any notes on RMM memory resource ownership in comments
rg "ownership|lifetime|thread" c/src/core/c_api.cpp | head -20Repository: rapidsai/cuvs
Length of output: 275
🏁 Script executed:
# Check if cuvsRMMPoolMemoryResourceEnable has similar issues by checking if it's thread_local or static
rg "pool_mr" c/src/core/c_api.cpp -B 5 -A 5Repository: rapidsai/cuvs
Length of output: 665
🏁 Script executed:
# Let's check the RMM version used and look for any behavior documentation
rg "rmm" vcpkg.json conanfile.txt requirements.txt 2>/dev/null | head -10Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
# Check if there are any thread safety comments or warnings in the C API header
cat -n c/include/cuvs/core/c_api.h | sed -n '220,250p'Repository: rapidsai/cuvs
Length of output: 1728
The async-memory resource owner cannot be thread_local when this API changes the current resource globally.
The implementation at c/src/core/c_api.cpp:188 stores cuda_async_memory_resource in thread_local async_mr and passes it to rmm::mr::set_current_device_resource(), but the documentation explicitly states this function "will change the memory resource for the whole process" (line 235). This creates a critical lifetime mismatch:
- If
set_current_device_resource()is device-scoped (affecting all threads), then when the enabling thread exits, itsthread_local async_mris destroyed while still registered as the current resource, leaving RMM with a dangling pointer. - If thread-local semantics were intended, the documentation must be updated to reflect that this only affects the calling thread.
The pool resource avoids this issue by passing temporary rvalues to set_current_device_resource(), allowing RMM to manage the lifetime. Either make async_mr process/device-scoped (not thread_local), or narrow the documentation and implementation to clarify thread-local semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/include/cuvs/core/c_api.h` around lines 229 - 240, The current
implementation of cuvsRMMAsyncMemoryResourceEnable stores
cuda_async_memory_resource in a thread_local async_mr and passes it to
rmm::mr::set_current_device_resource(), creating a lifetime mismatch vs the API
doc that says the change is global; change the implementation so the async
memory resource is process-scoped (not thread_local) by replacing thread_local
async_mr with a static/process-global instance (e.g., a static unique_ptr or
static object) so it outlives the thread and remains valid for
rmm::mr::set_current_device_resource(), and ensure
cuvsRMMAsyncMemoryResourceEnable and any cleanup use that same process-global
symbol (cuda_async_memory_resource / async_mr) when setting or resetting the
current device resource; alternatively, if thread-local semantics are intended,
update the cuvsRMMAsyncMemoryResourceEnable documentation to state it only
affects the calling thread and keep async_mr thread_local.
| // Only float32 is supported for multi-segment search. | ||
| RAFT_EXPECTS( | ||
| indices[0]->dtype.code == kDLFloat && indices[0]->dtype.bits == 32, | ||
| "Multi-segment search only supports float32 indices"); | ||
|
|
||
| using T = float; | ||
| using IdxT = uint32_t; | ||
| using OutIdxT = uint32_t; | ||
| using DistanceT = float; | ||
| using IndexT = cuvs::neighbors::cagra::index<T, IdxT>; | ||
|
|
||
| std::vector<const IndexT*> idx_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<const T, int64_t, raft::row_major>> q_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<OutIdxT, int64_t, raft::row_major>> n_vec(num_segments); | ||
| std::vector<raft::device_matrix_view<DistanceT, int64_t, raft::row_major>> d_vec(num_segments); | ||
|
|
||
| for (uint32_t i = 0; i < num_segments; i++) { | ||
| RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0, | ||
| "Index at position %u is null or not built", i); | ||
| idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr); | ||
| q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]); | ||
| n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]); | ||
| d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]); | ||
| } |
There was a problem hiding this comment.
Validate every segment before the reinterpret_cast.
Only indices[0] is checked for float32. The loop then blindly casts every indices[i]->addr to index<float, uint32_t>* and builds float32 device views from queries[i], neighbors[i], and distances[i]. If any later segment is half/int8/uint8 or any tensor is not device-backed, this path will feed the kernel mismatched types and can produce garbage results or GPU faults. Please mirror the per-input dtype/device/null checks from cuvsCagraSearch for each segment before from_dlpack.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/cagra.cpp` around lines 710 - 733, The loop currently casts
every indices[i]->addr and builds device views without validating each segment's
types/devices; update the loop that fills idx_vec, q_vec, n_vec, d_vec to
perform the same per-segment checks used in cuvsCagraSearch: assert indices[i]
!= nullptr and addr != 0 (already present), then verify indices[i]->dtype.code
== kDLFloat && indices[i]->dtype.bits == 32 for every i, and validate that
queries[i], neighbors[i], and distances[i] are device-backed DLPack tensors with
the expected element types (float for queries/distances, uint32 for neighbors)
before calling reinterpret_cast<const IndexT*>(indices[i]->addr) and
cuvs::core::from_dlpack to populate q_vec[i], n_vec[i], d_vec[i]; replace the
blind casts with RAFT_EXPECTS that include i in the error messages so a bad
segment fails fast and clearly.
| for (uint32_t i = 0; i < num_segments; i++) { | ||
| RAFT_EXPECTS(indices[i] != nullptr && indices[i]->addr != 0, | ||
| "Index at position %u is null or not built", i); | ||
| idx_vec[i] = reinterpret_cast<const IndexT*>(indices[i]->addr); | ||
| q_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(q_vec[i])>>(queries[i]); | ||
| n_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(n_vec[i])>>(neighbors[i]); | ||
| d_vec[i] = cuvs::core::from_dlpack<std::remove_reference_t<decltype(d_vec[i])>>(distances[i]); | ||
| } | ||
|
|
||
| cuvs::neighbors::cagra::search_multi_segment( | ||
| *res_ptr, search_params, idx_vec, q_vec, n_vec, d_vec); |
There was a problem hiding this comment.
Reject mixed-distance metrics across segments.
This API combines raw distances from all segments into one global ranking, so all indices must use the same metric. Right now nothing checks that indices[i] matches indices[0] on metric, which means mixing L2/IP/Cosine segments will return nonsensical top-k ordering even though the header says the distances are comparable across segments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/cagra.cpp` around lines 726 - 736, The loop that builds
idx_vec/q_vec/n_vec/d_vec must also validate that all segment indices use the
same distance metric as the first segment to prevent mixing incompatible
metrics; inside the for-loop in cagra.cpp (after the existing RAFT_EXPECTS that
checks indices[i] non-null) compare indices[i]->metric (or the actual metric
field name on your IndexT struct/class) to indices[0]->metric and fail fast with
RAFT_EXPECTS (or equivalent) and a clear message like "Mixed distance metrics
across segments: expected %s but got %s at segment %u"; keep this check before
pushing idx_vec[i] and before calling
cuvs::neighbors::cagra::search_multi_segment so the function only runs when all
segments share the same metric.
| extern "C" cuvsError_t cuvsSelectK(cuvsResources_t res, | ||
| DLManagedTensor* in_val, | ||
| DLManagedTensor* out_val, | ||
| DLManagedTensor* out_idx) | ||
| { | ||
| return cuvs::core::translate_exceptions([=] { | ||
| auto* res_ptr = reinterpret_cast<raft::resources*>(res); | ||
|
|
||
| int64_t n = in_val->dl_tensor.shape[1]; | ||
| int64_t k = out_val->dl_tensor.shape[1]; | ||
|
|
||
| auto in_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>( | ||
| static_cast<const float*>(in_val->dl_tensor.data), 1, n); | ||
|
|
||
| auto out_val_view = raft::make_device_matrix_view<float, int64_t, raft::row_major>( | ||
| static_cast<float*>(out_val->dl_tensor.data), 1, k); | ||
|
|
||
| auto out_idx_view = raft::make_device_matrix_view<int64_t, int64_t, raft::row_major>( | ||
| static_cast<int64_t*>(out_idx->dl_tensor.data), 1, k); | ||
|
|
||
| cuvs::selection::select_k( | ||
| *res_ptr, | ||
| in_view, | ||
| std::nullopt, // implicit positions [0, n) as in_idx | ||
| out_val_view, | ||
| out_idx_view, | ||
| true); // select_min = true (smallest distance = nearest neighbor) |
There was a problem hiding this comment.
Validate the DLPack contract before dereferencing shape[1] and casting buffers.
cuvsSelectK currently assumes non-null 2D CUDA tensors with float32 / float32 / int64 dtypes, zero offset, and contiguous row-major layout. Without checking ndim, shape, device, dtype, byte_offset, and stride compatibility, malformed callers can crash here or feed corrupted views into select_k.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/selection/select_k.cpp` around lines 14 - 40, cuvsSelectK dereferences
shape[1] and casts buffers without validating the DLPack tensors; add explicit
validation at the top of cuvsSelectK for in_val, out_val, out_idx (non-null),
then check each DLManagedTensor->dl_tensor for expected ndim (==2), shapes (rows
match expected 1 or compatible), dtype (in_val/out_val float32, out_idx int64),
device type (CUDA) and device id, byte_offset == 0, and contiguous row-major
strides/compatibility before creating device views with
raft::make_device_matrix_view; if any check fails return an appropriate
cuvsError_t (or throw inside translate_exceptions) instead of proceeding to
casts and calling cuvs::selection::select_k so malformed callers cannot crash or
corrupt memory.
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const half, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const int8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<uint32_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); | ||
|
|
||
| void search_multi_segment( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const uint8_t, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, | ||
| const std::vector<raft::device_matrix_view<float, int64_t, raft::row_major>>& distances); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Declaration in cpp/src/neighbors/detail/cagra/cagra_search.cuh"
rg -n -A8 -B2 'template <typename T, typename OutputIdxT = uint32_t, typename IdxT = uint32_t' \
cpp/src/neighbors/detail/cagra/cagra_search.cuh
echo
echo "Explicit instantiation site in cpp/src/neighbors/cagra_search_inst.cu.in"
rg -n -A8 -B2 'CUVS_INST_CAGRA_SEARCH_MULTI_SEGMENT|search_multi_segment<' \
cpp/src/neighbors/cagra_search_inst.cu.inRepository: rapidsai/cuvs
Length of output: 2317
🏁 Script executed:
# Check for static_assert on IdxT in the implementation
rg -n "static_assert.*IdxT" cpp/src/neighbors/detail/cagra/cagra_search.cuh
echo "---"
# Read the header to see all overloads and their documentation
cat -n cpp/include/cuvs/neighbors/cagra.hpp | sed -n '1710,1810p'Repository: rapidsai/cuvs
Length of output: 6679
Fix template parameter order mismatch and add missing documentation.
The instantiation macro calls the template with parameter order <T, IdxT, OutputIdxT>, but the template declares them as <T, OutputIdxT, IdxT>. When the macro instantiates with (data_t, uint32_t, int64_t), it produces <data_t, uint32_t, int64_t>, incorrectly mapping IdxT to int64_t. This violates the static_assert at cagra_search.cuh:276 (IdxT == uint32_t) and will fail at link time for the int64_t output overloads.
Additionally, overloads 2–8 (lines 1752–1806) lack Doxygen documentation. Per the coding guidelines, all public functions must be documented. Add a @copydoc search_multi_segment directive before each additional overload, or provide explicit documentation blocks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/include/cuvs/neighbors/cagra.hpp` around lines 1744 - 1806, The header
declares the template parameter order as <T, OutputIdxT, IdxT> but the
instantiation macro (used with <T, IdxT, OutputIdxT> e.g. (data_t, uint32_t,
int64_t)) expects <T, IdxT, OutputIdxT>, causing the IdxT/OutputIdxT swap that
trips the static_assert in cagra_search.cuh:276 and breaks link-time overloads
for int64_t; fix by changing the template parameter order in the
search_multi_segment declarations to <T, IdxT, OutputIdxT> (or alternatively
update the instantiation macro to match the declared order) so types map
correctly, and add missing Doxygen for overloads 2–8 by inserting a `@copydoc`
search_multi_segment (or equivalent documentation block) above each of those
overloaded search_multi_segment declarations to satisfy the public API docs
requirement.
| public int nextSlot(int count) { | ||
| int start = slotCounter; | ||
| slotCounter += count; | ||
| return start; | ||
| } |
There was a problem hiding this comment.
Make nextSlot atomic and modulo-bounded.
At Line 116–118, slotCounter updates are racy and the returned value is not wrapped despite the method contract; this can cause slot collisions and eventual overflow issues.
🔧 Proposed fix
import java.lang.foreign.Arena;
import java.lang.foreign.MemorySegment;
+import java.util.concurrent.atomic.AtomicInteger;
@@
- private int slotCounter;
+ private final AtomicInteger slotCounter = new AtomicInteger(0);
@@
public int nextSlot(int count) {
- int start = slotCounter;
- slotCounter += count;
- return start;
+ if (count <= 0 || count > size) {
+ throw new IllegalArgumentException("count must be in [1, size]");
+ }
+ return Math.floorMod(slotCounter.getAndAdd(count), size);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public int nextSlot(int count) { | |
| int start = slotCounter; | |
| slotCounter += count; | |
| return start; | |
| } | |
| public int nextSlot(int count) { | |
| if (count <= 0 || count > size) { | |
| throw new IllegalArgumentException("count must be in [1, size]"); | |
| } | |
| return Math.floorMod(slotCounter.getAndAdd(count), size); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 115 - 119, Make nextSlot atomic and wrap the returned slot index by
changing slotCounter to an AtomicInteger and using an atomic get-and-add plus
modulo; specifically, replace the non-atomic increment in nextSlot with
something like int start = slotCounter.getAndAdd(count); then return
Math.floorMod(start, poolSize) (or equivalent using your pool size field) so the
operation is thread-safe and the returned slot is bounded by the pool size.
| @Override | ||
| public void close() { | ||
| for (int i = 0; i < size; i++) { | ||
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | ||
| cuvsResourcesDestroy(resources[i]); | ||
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all cuvsResourcesDestroy call sites are validated.
rg -n -C2 '\bcuvsResourcesDestroy\s*\(' --type=javaRepository: rapidsai/cuvs
Length of output: 1298
Wrap cuvsResourcesDestroy with error validation during cleanup.
Line 129 calls cuvsResourcesDestroy without checkCuVSError, allowing cleanup failures to be silently ignored. Other call sites in the codebase (e.g., CuVSResourcesImpl.java) properly validate the return code. Apply the same pattern here for consistency:
Proposed fix
- cuvsResourcesDestroy(resources[i]);
+ checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Override | |
| public void close() { | |
| for (int i = 0; i < size; i++) { | |
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | |
| cuvsResourcesDestroy(resources[i]); | |
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | |
| } | |
| `@Override` | |
| public void close() { | |
| for (int i = 0; i < size; i++) { | |
| checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy"); | |
| checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy"); | |
| checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy"); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CudaStreamPool.java`
around lines 125 - 131, In CudaStreamPool.close(), wrap the cuvsResourcesDestroy
call with the same error validation used elsewhere by replacing the raw
cuvsResourcesDestroy(resources[i]) invocation with a call to
checkCuVSError(cuvsResourcesDestroy(resources[i]), "cuvsResourcesDestroy") so
cleanup failures are logged/handled; keep the surrounding loop and existing
calls to checkCudaError(cudaEventDestroy(events[i]), "cudaEventDestroy") and
checkCudaError(cudaStreamDestroy(streams[i]), "cudaStreamDestroy") unchanged and
reference the close(), cuvsResourcesDestroy, checkCuVSError, events, resources,
streams, and size symbols to locate the change.
| private final CudaStreamPool streamPool = | ||
| new CudaStreamPool( | ||
| Integer.getInteger(CudaStreamPool.SIZE_PROPERTY, CudaStreamPool.DEFAULT_SIZE)); |
There was a problem hiding this comment.
Construct the stream pool inside the constructor.
Allocating CudaStreamPool in a field initializer leaks native resources if CuVSResourcesImpl(Path) throws later during cuvsResourcesCreate/cuvsDeviceIdGet, because the object never finishes construction and close() is never reachable. Build the pool inside the constructor and tear it down in the constructor's failure path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSResourcesImpl.java`
around lines 31 - 33, The CudaStreamPool is currently allocated in the field
initializer (streamPool) which can leak native resources if
CuVSResourcesImpl(Path) throws before construction completes; move creation of
the CudaStreamPool from the field initializer into the CuVSResourcesImpl(Path)
constructor (use Integer.getInteger(CudaStreamPool.SIZE_PROPERTY,
CudaStreamPool.DEFAULT_SIZE) to determine size), assign it to the streamPool
field there, and in the constructor failure path ensure you call
streamPool.close() (or otherwise tear it down) before rethrowing so native
resources are not leaked; keep the streamPool field declaration but initialize
it only in the constructor and ensure close() is reachable on exceptions.
| CagraSearchParams searchParameters = queries.get(0).getCagraSearchParameters(); | ||
|
|
||
| try (var resourcesAccessor = resources.access()) { | ||
| long cuvsRes = resourcesAccessor.handle(); | ||
| var cuvsStream = getStream(cuvsRes); | ||
|
|
||
| try (var globalNeighborsDP = allocateRMMSegment(cuvsRes, neighborsBytes); | ||
| var globalDistancesDP = allocateRMMSegment(cuvsRes, distancesBytes); | ||
| var outIdxDP = allocateRMMSegment(cuvsRes, outIdxBytes); | ||
| var outValDP = allocateRMMSegment(cuvsRes, outValBytes)) { | ||
|
|
||
| // --- Phase 1: call cuvsCagraSearchMultiSegment --- | ||
| // Single kernel launch covers all segments; results land in globalNeighborsDP / | ||
| // globalDistancesDP on the same CUDA stream, so SelectK below sees them via ordering. | ||
| try (var arena = Arena.ofConfined()) { | ||
| MemorySegment sp = CuVSParamsHelper.buildCagraSearchParams(arena, searchParameters); |
There was a problem hiding this comment.
Per-segment query settings are being dropped.
This path always builds search params from queries.get(0), and the native multi-segment API you're calling has no filter argument, so any search params or prefilter carried by queries[1..] are silently ignored. Either validate that every CagraQuery is equivalent and unfiltered, or plumb per-segment filtering/params through the native API before exposing List<CagraQuery> here.
Also applies to: 147-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 96 - 111, The code in MultiSegmentCagraSearch currently uses
CagraSearchParams built from queries.get(0) (via
CuVSParamsHelper.buildCagraSearchParams) and thus drops per-segment settings
from subsequent CagraQuery entries; fix by either validating that all CagraQuery
instances in queries have identical search params and no per-segment filters
(throw IllegalArgumentException from MultiSegmentCagraSearch if any CagraQuery
differs from queries.get(0)), or plumb per-segment parameters through the native
call: extend the native wrapper (the cuvsCagraSearchMultiSegment binding) and
CuVSParamsHelper to accept/allocate an array of CagraSearchParams (build one
MemorySegment per CagraQuery) and pass that array/handle when invoking the
multi-segment search so each segment’s filters/params are applied.
| long[] segShape = {1, k}; | ||
| for (int i = 0; i < numSegments; i++) { | ||
| // Index handle | ||
| indexArray.setAtIndex(ValueLayout.ADDRESS, i, buffered[i].getIndexHandle()); | ||
|
|
||
| // Query DLTensor | ||
| var queryVectors = (CuVSMatrixInternal) queries.get(i).getQueryVectors(); | ||
| queriesArray.setAtIndex(ValueLayout.ADDRESS, i, queryVectors.toTensor(arena)); | ||
|
|
||
| // Neighbors DLTensor — slice of global buffer | ||
| long nByteOffset = (long) i * k * Integer.BYTES; | ||
| MemorySegment nSlice = | ||
| MemorySegment.ofAddress(globalNeighborsDP.handle().address() + nByteOffset); | ||
| neighborsArray.setAtIndex( | ||
| ValueLayout.ADDRESS, | ||
| i, | ||
| prepareTensor(arena, nSlice, segShape, kDLUInt(), 32, kDLCUDA())); | ||
|
|
||
| // Distances DLTensor — slice of global buffer | ||
| long dByteOffset = (long) i * k * Float.BYTES; | ||
| MemorySegment dSlice = | ||
| MemorySegment.ofAddress(globalDistancesDP.handle().address() + dByteOffset); | ||
| distancesArray.setAtIndex( | ||
| ValueLayout.ADDRESS, | ||
| i, | ||
| prepareTensor(arena, dSlice, segShape, kDLFloat(), 32, kDLCUDA())); | ||
| } |
There was a problem hiding this comment.
Enforce the single-query contract before sizing output buffers.
Each segment gets a [1, k] neighbors/distances tensor, but the input query tensor comes straight from queries.get(i).getQueryVectors(). If any caller passes nq > 1, the native search will try to write nq * k results into storage sized for only k, which is an out-of-bounds device write. Please reject multi-row query tensors here or size the output slices from the actual query row count.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@java/cuvs-java/src/main/java22/com/nvidia/cuvs/MultiSegmentCagraSearch.java`
around lines 119 - 145, The code must enforce the single-query contract or size
outputs from the actual query row count: in MultiSegmentCagraSearch, after
obtaining var queryVectors = (CuVSMatrixInternal)
queries.get(i).getQueryVectors(), read its row count (e.g.,
queryVectors.getRowCount()/numRows()/size(0) — use the actual accessor on
CuVSMatrixInternal) into int nq; if nq > 1 throw an IllegalArgumentException
rejecting multi-row queries, or alternatively set segShape = new long[] {nq, k}
and compute neighbor/distance byte offsets and tensor sizes using nq*k (update
nByteOffset/dByteOffset and prepareTensor calls for
neighborsArray/distancesArray accordingly) so prepareTensor and the
globalNeighborsDP/globalDistancesDP slices match the query row count.
| * @param[out] neighbors array of num_segments DLManagedTensor* (device, uint32, [nq, topk]) | ||
| * @param[out] distances array of num_segments DLManagedTensor* (device, float32, [nq, topk]) | ||
| */ | ||
| cuvsError_t cuvsCagraSearchMultiSegment(cuvsResources_t res, |
There was a problem hiding this comment.
Just to align on nomenclature a bit, I wonder if we can think of a more general name. Maybe "Partition"? Segment is pretty closely coupled to databases, and more specifically to LSM-based databases, but cuVS the library is more general that that. cuVS is at the level of "hash partitioning" or "blind sharding" (those are the terms we tend to use in this context). I think "MultiPartition" would be a more fitting name.
There was a problem hiding this comment.
Aligning to "partition" for now. FYI, also considered: "MultiShard", "MultiIndex", "Federated", but these might come with unintended connotations.
| * @param[out] out_idx DLManagedTensor* shape [1, k], int64, device memory | ||
| * @return cuvsError_t | ||
| */ | ||
| cuvsError_t cuvsSelectK(cuvsResources_t res, |
There was a problem hiding this comment.
Oh this is great. I was just working on code examples for the new docs and realized we only have a C++ API for select_k. It'll be great to get the C APis, and later on the Python and other language wrappers for select-k.
There was a problem hiding this comment.
With the refactoring prompted by your other comment, select-k is no longer needed for this work. Leaving the C API intact in case it might be useful to others.
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| const std::vector<raft::device_matrix_view<const float, int64_t, raft::row_major>>& queries, | ||
| const std::vector<raft::device_matrix_view<int64_t, int64_t, raft::row_major>>& neighbors, |
There was a problem hiding this comment.
It seems a bit odd to return a vector of outputs instead of a single output. Can't we perform the reduction for the final single neighborhood outputs?
There was a problem hiding this comment.
Refactored this API to return the global top-k results (i.e. after select_k) instead of returning per-partition top-k results which then have to be reduced via select_k.
When libcuvs_c.so is built with cudart_static, CUDA symbols are embedded but not exported to the dynamic symbol table. Panama FFI's SymbolLookup cannot resolve them, causing an UnsatisfiedLinkError on cudaMemcpyAsync at initialization time. Add a static initializer in Util that loads libcudart.so before any CUDA symbol resolution occurs. The loader tries in order: java.library.path (via System.loadLibrary), each entry in LD_LIBRARY_PATH, and /usr/local/cuda/lib64 as a final fallback. If all attempts fail, a descriptive warning is logged. Builds with dynamic CUDA linkage are unaffected.
Extend cuvsCagraSearchMultiSegment with a cuvsFilter parameter so callers can apply a bitset prefilter. C/C++ API changes: - Add MULTI_SEGMENT_BITSET to cuvsFilterType enum - Define cuvsMultiSegmentBitsetFilter (combined_bitset + segment_offsets) in common.h alongside the rest of the filter API - Add cuvsFilter parameter to cuvsCagraSearchMultiSegment in both the C and C++ headers and implement dispatch in cagra.cpp / cagra.cuh Java changes: - Add FilterBitsetHandle: packs per-segment host bitsets and manages a single shared device allocation uploaded lazily on first use via double-checked locking - Update MultiSegmentCagraSearch to accept a FilterBitsetHandle and pass the corresponding cuvsMultiSegmentBitsetFilter to the native search API
Extends generate_groundtruth with optional bitset prefiltering so that exact-NN ground truth can be computed over a filtered subset of the dataset. New function create_bitset_filter(n_samples, filter_reject_rate) returns a packed uint32 numpy array (LSB-first, matching cuVS bitset layout) where bit i is set iff vector i passes the filter. Uses a modulo-1000 bucket scheme: vector i passes when i % 1000 >= round(filter_reject_rate * 1000), giving a reject rate within 0.1% of the requested value. calc_truth gains an optional bitset parameter: - GPU path: slices the packed words for each batch, transfers to device, and passes as a cuVS bitset prefilter to brute_force.search - CPU path: unpacks the same words to a boolean accept_mask and masks rejected distances to inf before argpartition Two new mutually exclusive CLI arguments: - --bitset <path>: load a pre-saved .npy uint32 bitset from file - --filter_reject_rate <float>: generate a bitset in memory via create_bitset_filter
Previously the public search_multi_partition API exposed an intermediate result form: callers passed per-partition vectors of matrix views for queries, neighbors, and distances, and were responsible for merging the per-partition top-k themselves (the cuvs-lucene wrapper allocated a concatenated [num_partitions × k] buffer, called select_k, and decoded positions into (partition_index, ordinal) pairs). This commit moves all of that into cuVS. The merged form expresses "search this queries matrix against every partition; return the global top-k per query" with no intermediate result visible to callers. C++ API (8 overloads in cagra.hpp): - Single raft::device_matrix_view<const T, ...> queries [n_queries, dim] - New output partition_ids [n_queries, k] (uint32) identifying which partition each neighbor came from - Single neighbors [n_queries, k] of OutputIdxT (ordinal in the corresponding partition's dataset) - Single distances [n_queries, k] of post-processed values C API (cuvsCagraSearchMultiPartition): - DLManagedTensor* for queries/partition_ids/neighbors/distances (no more DLManagedTensor** arrays) Detail implementation (cagra_search.cuh): - Allocates an internal [num_partitions, n_queries, k] intermediate buffer for neighbors and distances from the workspace pool - Runs per-partition kernel into this buffer - Applies per-partition distance post-processing on the intermediate buffer (scale + metric transform, preserving per-partition metric) - Transposes intermediate distances to [n_queries, num_partitions * k] via raft::linalg::map_offset for a batched raft::matrix::select_k - Decodes select_k positions into partition_ids and neighbors using raft::linalg::map / map_offset Kernel layer: - multi_partition_desc_t slimmed to (dataset_desc, graph, graph_degree); queries_ptr and result pointers are now shared kernel parameters, removing redundant per-partition state Java: - MultiPartitionCagraSearch.search(...) takes a single CagraQuery (not List<CagraQuery>); body collapses to three small allocations and a single native call - Slow-path BitSet-from-CagraQuery handling removed; FilterBitsetHandle is now the only filter mechanism - BufferedCagraSearch interface slimmed to getIndexHandle(); the searchIntoBuffer implementation in CagraIndexImpl is removed
…_size search_multi_partition previously asked each partition's SINGLE_CTA kernel to write `topk` neighbors into the intermediate buffer, regardless of the plan's itopk_size. When the caller requested topk > itopk_size, the kernel's shared-memory result buffer (sized itopk_size + max_candidates) was smaller than topk and the result-extraction loop in search_core read out of bounds, producing undefined behavior. This made the existing `plan.itopk_size * num_partitions >= topk` feasibility check load-bearing only on paper — it permitted configurations the kernel could not actually execute. Decouple the kernel's per-partition output count from the caller's global topk inside search_multi_partition: - Introduce `per_partition_topk = min(topk, plan.itopk_size)` - Size the [num_partitions, n_queries, *] intermediate buffer to per_partition_topk (not topk) so the kernel stays within its shared-memory budget - Launch run_multi_partition with per_partition_topk - Adjust post-processing slice widths, transpose row/partition strides, and the select_k input width to use per_partition_topk - Update position-decode arithmetic: partition index is pos / per_partition_topk and intra-partition slot is pos % per_partition_topk; output stride remains the global topk When topk <= itopk_size, per_partition_topk == topk and every buffer size, stride, and decode expression reduces to the prior values — the common case is bit-identical. When topk > itopk_size, each partition emits up to itopk_size candidates and the cross-partition select_k assembles the final global top-k from num_partitions * itopk_size candidates, which the existing feasibility check guarantees is at least topk. No changes to search_core, search_kernel_mp, run_multi_partition's signature, or the single-partition search() path. plan->check(topk) still enforces topk <= itopk_size for non-multi-partition callers.
…on itopk_size" This reverts commit 12d7f55.
search_multi_partition previously rewrote params.algo to SINGLE_CTA unconditionally, which silently downgraded MULTI_CTA and AUTO callers regardless of their intent and made the multi-partition path's algo selection invisible to the rest of the code. Replace the override with explicit algo resolution: - Reject MULTI_CTA with RAFT_FAIL. The multi-partition merge below assumes a single contiguous per-partition output slice from one plan type; MULTI_CTA's plan and result layout are not compatible. - Resolve AUTO based on itopk_size. SINGLE_CTA's hard itopk cap is 512; above that the search would fail at plan construction, so AUTO routes to MULTI_KERNEL. At or below 512 it stays on SINGLE_CTA (the existing default for this path). - Hold MULTI_KERNEL with RAFT_FAIL "not yet implemented." The fused multi-partition MULTI_KERNEL path is staged; the dispatch wiring lands here so the remaining stages only have to fill in behavior.
Stage 1 set up the algo dispatch in search_multi_partition with a hard fail on MULTI_KERNEL. This commit lands the actual MULTI_KERNEL path as a sequential per-partition shim, letting MULTI_KERNEL serve as the multi-partition workhorse when the caller needs itopk_size beyond SINGLE_CTA's 512 cap. Refactor of search_multi_partition: - Lift plan_desc construction, stream acquisition, and intermediate buffer allocation out of the SINGLE_CTA-specific block. Both algos share these now; the buffer layout [num_partitions, n_queries, topk] is algo-agnostic and the post-processing reads it the same way. - Replace the linear body with an `if (SINGLE_CTA) ... else /* MULTI_KERNEL */ ...` dispatch. The SINGLE_CTA branch retains the existing plan + feasibility check + per-partition descriptor setup + fused kernel launch, bit-identical to before. MULTI_KERNEL branch: - constexpr if guards out the filtered case with a clear RAFT_FAIL. Filter support lands alongside kernel-fusion work, where multi_partition_bitset_filter falls into place via blockIdx.z (the same mechanism SINGLE_CTA already uses). - For each partition, build a partition-specific dataset_desc via dataset_descriptor_init_with_cache and construct a fresh multi_kernel_search::search with that partition's dataset_size and graph_degree. Call operator() with the partition's graph view, writing into the partition's slice of the intermediate buffer. Per-partition plan construction is the simplest path to a functionally correct sequential implementation: each partition's distance compute uses that partition's own dataset descriptor, which a single shared plan cannot do (a shared plan stores one dataset_desc, and reusing it across partitions reads OOB when dataset sizes differ). Cost is O(num_partitions) buffer allocations per search call; the descriptor cache absorbs repeat plan_desc construction on hot indexes, so steady-state penalty is mostly first-query. This is throwaway shim code — the eventual kernel-fusion path replaces it with a single plan and partition-aware kernels.
The previous check, `itopk_size * num_partitions >= topk`, permitted configurations where each partition's per-query SINGLE_CTA kernel would be asked to write more topk values than its shared-memory result buffer (sized for itopk_size + search_width * graph_degree entries) can hold. The kernel's result-extraction loop reads past the buffer in that case, producing undefined behavior — typically illegal-address faults visible only mid-search. Replace with `topk <= plan.itopk_size`, matching the check single-partition CAGRA already enforces via plan->check(topk) at search_plan.cuh:367 (same message wording). Configurations needing larger topk should pick MULTI_KERNEL, which routes automatically once itopk_size > 512.
Stage 1 of fusing the multi-partition MULTI_KERNEL search path. Adds four partition-aware kernel variants alongside the existing single- partition kernels — runtime path is unchanged (cagra_search.cuh's MULTI_KERNEL branch still uses the stage-2 sequential per-partition shim). The next commit replaces that shim with a single fused call that drives these kernels via a new run_multi_partition method. Each _mp variant launches with a 3D grid that adds num_partitions as the partition dimension (typically blockIdx.z), reads per-partition data from a multi_partition_desc_t array indexed by that dimension, and uses row = partition_id * num_queries + query_id to index the per-(query, partition) slice of result, hashmap, and parent buffers. - multi_partition_desc_t: per-partition descriptor (dataset_desc, graph, graph_degree). Duplicated in the multi_kernel_search namespace rather than reusing single_cta_search's version, so each algo can evolve its descriptor independently. If more consumers emerge, fold into a shared base. - random_pickup_mp_kernel + launcher. - pickup_next_parents_mp_kernel + launcher. terminate_flag stays a single global scalar; same-value writes from multiple (query, partition) work units are race-safe. - compute_distance_to_child_nodes_mp_kernel + launcher. Per-partition graph_degree means the kernel's launch grid is sized for the max across partitions; partitions with smaller graph_degree leave trailing teams inactive. blockIdx.z = partition_id satisfies multi_partition_bitset_filter's contract automatically — filter support falls in without further changes. - remove_parent_bit_mp_kernel + launcher.
Validates two of the four mp expansion kernels by comparing their per-(query, partition) outputs against running the existing single- partition kernels sequentially over each partition's slice. - remove_parent_bit_mp_matches_sequential: confirms the mp variant's row arithmetic clears MSBs in the correct per-(query, partition) rows and leaves trailing entries beyond num_topk untouched. - pickup_next_parents_mp_matches_sequential: confirms parent_list, parent_candidates (with MSB updates marking used parents), and terminate_flag all match the sequential reference. Runs with small_hash_bitlen = 0 to keep the input setup minimal. Tests for random_pickup_mp and compute_distance_to_child_nodes_mp are deferred — those kernels rely on a fully-constructed dataset_descriptor_host, and any row-arithmetic bug surfaces just as clearly as a recall regression in the eventual integration validation against SINGLE_CTA. tests/CMakeLists.txt: registers NEIGHBORS_ANN_CAGRA_MULTI_KERNEL_MP_TEST with cpp/src and cpp/build/src added to its include path so it can reach search_multi_kernel.cuh and the generated compute-distance extension header. Existing tests only include public API headers, so this is the first test target to need internal-source access.
Replaces the per-partition shim (which constructed a fresh multi_kernel_search::search per partition and ran them one at a time) with a fused implementation that drives all partitions in one search call. Each kernel in the iteration loop now launches with a 3D grid (queries, partitions, …); per-(query, partition) state is keyed by row = partition_id * num_queries + query_id.
Reverts the partition-aware MULTI_KERNEL search path. MULTI_KERNEL is a reference implementation, not deeply optimized; single-partition AUTO never selects it, and multi-partition MULTI_KERNEL runs ~10x slower than SINGLE_CTA in practice. The path forward for topk > 512 in multi-partition search is MULTI_CTA, not MULTI_KERNEL. Reverts: 4db0941 Fuse MULTI_KERNEL multi-partition into a single search call aa93362 Add unit tests for partition-aware MULTI_KERNEL kernels 56e99ee Add partition-aware MULTI_KERNEL expansion kernels 37a65d7 Tighten SINGLE_CTA multi-partition feasibility check cf679b9 Enable MULTI_KERNEL in multi-partition CAGRA search 6b4e34f Route multi-partition CAGRA search by params.algo The SINGLE_CTA tightening from 37a65d7 will be re-applied in a follow-up commit.
The previous check, `itopk_size * num_partitions >= topk`, permitted configurations where each partition's per-query SINGLE_CTA kernel would be asked to write more topk values than its shared-memory result buffer (sized for itopk_size + search_width * graph_degree entries) can hold. The kernel's result-extraction loop reads past the buffer in that case, producing undefined behavior — typically illegal-address faults visible only mid-search. Replace with `topk <= plan.itopk_size`, matching the check single-partition CAGRA already enforces via plan->check(topk) at search_plan.cuh:367 (same message wording).
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const float, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<float, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const float, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const half, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<half, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const half, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<int8_t, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const int8_t, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
||
| void search_multi_partition( | ||
| raft::resources const& res, | ||
| cuvs::neighbors::cagra::search_params const& params, | ||
| const std::vector<const cuvs::neighbors::cagra::index<uint8_t, uint32_t>*>& indices, | ||
| raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> queries, | ||
| raft::device_matrix_view<uint32_t, int64_t, raft::row_major> partition_ids, | ||
| raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors, | ||
| raft::device_matrix_view<float, int64_t, raft::row_major> distances, | ||
| const cuvs::neighbors::filtering::base_filter& sample_filter = | ||
| cuvs::neighbors::filtering::none_sample_filter{}); | ||
|
|
There was a problem hiding this comment.
Rather than creating a new set of overloads specific for CAGRA, maybe we'd better introduce a new index type, e.g. multi_partition_cagra? This whole thing looks like it can be represented as a "virtual merge" index we've already introduced once.
This has couple benefits:
- Public API stays the same homogeneous - no new search functions for a user to discover - just a new index type and its constructor.
- the new index search would be composable with dynamic batching (which takes some index and calls search on it).
There was a problem hiding this comment.
Thanks, @achirkin, for the insightful comment. Assuming that composite_index is the "virtual merge" index you mentioned, it does look very attractive as an existing infrastructure for multi-partition search. There are two wrinkles that prevent immediate adoption:
composite_indeximplements the per-index search in a for loop, which does not expose partition-level parallelism needed to maximize device utilization when the batch size is 1 as is always the case in cuVS-Lucene. For the current work, we actually had to create multi-partition versions of the entire call stack down to the kernels and device functions (e.g. search_multi_cta_mp_jit). Now, the multi-partition code can in theory entirely subsume the original single-partition code by essentially settingnum_partition=1as needed but I don't think we should make such a sweeping change in this PR.- For cuVS-Lucene, we need to produce the partition-local ordinals of the neighbors instead of the global indices. However,
composite_indexinternally consumes the ordinals to produce only the global indices.
Both issues above can possibly be resolved with additional configuration parameters and corresponding code paths in composite_index.
On the other hand, making multi_partition_cagra its own header probably involves easier changes. Do you think we should make a push for extending/customizing composite_index at this time?
There was a problem hiding this comment.
Rather than creating a new set of overloads specific for CAGRA, maybe we'd better introduce a new index type, e.g. multi_partition_cagra? This whole thing looks like it can be represented as a "virtual merge" index we've already introduced once.
I don't know that I agree with this. This needs to be a phyiscal merge, not a logical merge. This is critical for LSM-based databases and it'll be widely used outside of just Lucene. I'm not saying we can't (and shouldn't) improve the API experience wherever possible, but I'm not convinced the "logical merge" (or composite index) is not the way to do that.
There was a problem hiding this comment.
Hi Corey, I totally agree with you comment. I meant to say structurally the code is very similar and I think it is a good pattern to use for this case too.
There was a problem hiding this comment.
I like your thinking here for sure- if there's a good reusable abstraction for this that can consolidate APIs (ideally without introducing a complex class hierarchy on the cuvs impl side) then I'm all for it. I say "cuvs impl side" because abstractions like device_resources, while they have established a hierarchy on the user side, are just meant to be simple user convenience and we use raft::resources container on the implementation side).
There was a problem hiding this comment.
Following this discussion, I'm planning to refactor this work into a multi_partition_cagra index type that is constructed from a vector of CAGRA indices (like composite_index) and provides a search method that implements the current optimizations (unlike composite_index which doesn't really implement the search at a low-level). Does that sound about right? Does it defeat the purpose somewhat given that the signature of this search method can't conform to the signature common to other index types as we need an output array to indicate the id of the CAGRA index that a result comes from? Thanks for your thoughts.
There was a problem hiding this comment.
Sorry. I would not suggest this design @jamxia155. Let's keep this simple. Index shoukd also not have search methods (we use free functions in cuVS not methods on objects).
Again to reiterate- composite index is going away. We should not be using that as a guide for designs.
There was a problem hiding this comment.
After further discussions with @cjnolet, the plan is to keep things simple and essentially rename search_multi_partition as search for consistency, while allowing for necessary parameters in the signature even if they are not common to other search functions.
… stride The multi-partition post-processing in search_multi_partition assumed the kernel writes exactly `topk` candidates per partition (the SINGLE_CTA mp convention). Generalize this assumption so other algos can emit a different number of candidates per (query, partition) — useful for paths that skip the per-partition top-k merge and rely entirely on the cross-partition select_k. Introduces `per_partition_topk` (= `topk` for SINGLE_CTA, no functional change). Replaces topk-as-stride usages with per_partition_topk: - intermediate buffer sizing and per-partition slice width - per-partition distance post-processing slice shape - transpose row stride and per-partition stride arithmetic - batched select_k input width (num_partitions * per_partition_topk) - position decode: pos / per_partition_topk → partition_id, pos % per_partition_topk → slot within partition The caller-owned output buffers (neighbors, distances, partition_ids, positions_buf) keep stride `topk` because that's the requested output shape. The position-decode lambda now disambiguates the two strides explicitly (topk_i64 for output indexing, per_partition_topk_i64 for intermediate indexing).
Adds a fused multi-partition launcher to multi_cta_search::search and
wires it into cagra::detail::search_multi_partition alongside the
existing SINGLE_CTA path. MULTI_CTA spawns ceildiv(itopk_size, 32) CTAs
per query, so it can serve large itopk pools that SINGLE_CTA's
itopk_size <= 512 cap rejects (e.g. topk=1500).
Kernel: search_kernel_mp runs on a 3D grid
(num_cta_per_query, num_queries, num_partitions). blockIdx.z indexes
multi_partition_desc_t to fetch each partition's dataset_desc, graph,
and graph_degree; shared memory and the result buffer are sized for
the max graph_degree across partitions. The cross-CTA traversed
hashmap is sized for (num_queries * num_partitions) rows so that CTAs
serving the same query across different partitions never collide.
Output layout matches SINGLE_CTA multi-partition:
[num_partitions, num_queries, per_partition_topk] partition-major,
with per_partition_topk = num_cta_per_query * itopk_size for MULTI_CTA
(vs. topk for SINGLE_CTA). No per-partition top-k merge is performed
in the MULTI_CTA path -- the existing cross-partition select_k in
search_multi_partition picks the final global top-k in one pass.
Dispatch in search_multi_partition:
- Forces persistent=false (multi-partition does not use persistent).
- Rejects MULTI_KERNEL explicitly (reference impl, far slower).
- Resolves AUTO to MULTI_CTA when itopk_size > 512, else SINGLE_CTA.
- Branches on algo: each branch builds its own plan, per-partition
descriptor array, and intermediate buffer, then calls
run_multi_partition.
Shared post-processing (per-partition distance transform, cross-
partition select_k, partition_ids decode) operates uniformly on
per_partition_topk and serves both algos unchanged.
Multi-partition search_multi_partition's AUTO resolution previously picked SINGLE_CTA whenever itopk_size <= 512, mirroring only half of single-partition's heuristic in search_plan_impl_base. The other half -- requiring max_queries >= num_sm * 2 before picking SINGLE_CTA -- guards against underfilling the GPU when there are too few queries to saturate the SMs. The same concern applies in multi-partition, but the partition axis already pads the grid: SINGLE_CTA produces (1 * num_queries * num_partitions) CTAs. Scale the gate by num_partitions accordingly: AUTO picks SINGLE_CTA only when max_queries * num_partitions >= num_sm * 2 (and itopk_size <= 512); otherwise it falls through to MULTI_CTA, whose ceildiv(itopk_size, 32) CTAs per query restore occupancy.
…earch Adapt the branch's multi-segment / multi-partition CAGRA additions to the new JIT-LTO kernel infrastructure landed in NVIDIA#1807. After the merge, multi-partition search runs through JIT-linked fragments just like the rest of CAGRA, with parity across filter types (none, bitset, mp_bitset). - Port deleted *_kernel-inl.cuh contents into the JIT layout: device bodies in jit_lto_kernels/*_jit.cuh, .cu.in entry-points, matrix JSONs, fragment tags, planners, factory functions, host launchers, CMake registration. - Introduce mp_bitset_filter_data_t + tag_filter_mp_bitset + matching sample_filter_mp_bitset_impl so multi_partition_bitset_filter is recognized end-to-end without coupling to the standard bitset POD. - Add BitsetT template parameter to search_core so it accepts either cagra_bitset or mp_cagra_bitset without doubling instantiations. - Add CUVS_EXPORT to four C entry points that were silently hidden: cuvsRMMAsyncMemoryResourceEnable, cuvsResourcesSetWorkspacePool, cuvsCagraSearchMultiPartition, cuvsSelectK. - Update JDKProvider.java to drop the stale specific import for cudaStreamSynchronize that jextract has reshuffled into headers_h.








This PR adds a CAGRA search API and associated implementation for batching workloads involving a single query and multiple indexes (e.g. index segments built from one dataset by cuvs-lucene). This promotes GPU utilization for non-batched queries by parallelizing work over multiple CTAs, while minimizing the number of concurrent kernels for potential host-side parallelism.
Addresses cuvs-lucene issue 124, issue 158, and cuvs issue 2166.
Associated cuvs-lucene PR: rapidsai/cuvs-lucene#133.