Introduce Scality AI Connector by oliviergaraud · Pull Request #2 · scality/nixl

oliviergaraud · 2026-06-10T14:53:21Z

What?

Adds a new NIXL OBJ-plugin accelerated engine, "Scality AI Connector", that moves object data to and from a Scality endpoint over RDMA (cuObject DC transport), using a small HTTP request only to carry the command.

The PR includes:

The engine (src/plugins/obj/rest_accel/scality_ai_connector/): a two-plane design where object bytes move over RDMA (GPU-direct capable) while a tiny HTTP PUT/GET with an empty body carries the cuObject RDMA descriptor in an x-scal-rdma header. The HTTP control-plane client (RestClient) drives all requests with libcurl's multi interface from a single poller thread, dispatching completion callbacks to a worker pool sized by num_threads.
Build detection: discovers the version-suffixed cuobjclient-<ver> pkg-config name so the accelerated engines build without hard-coding a CUDA version.
Unit tests: HTTP wire-format tests (correct PUT/GET/HEAD, header, empty body, against a loopback server, no RDMA/cuObject required) and concurrency tests (poller keeps more requests in flight than its thread count; teardown with outstanding requests fires every callback exactly once).
nixlbench integration: wires up the OBJ accelerated path (flags, REST object pre-population/cleanup, kvbench args) so the engine can be benchmarked, with concurrent (OpenMP) object seeding/teardown.

Why?

Scality's AI Connector speaks its own REST dialect and is not S3-compatible, so the existing s3/s3_crt engines cannot target it. This engine lets NIXL move large buffers (often in GPU memory) to and from a Scality store over RDMA/GPUDirect, with HTTP used purely as a lightweight control plane. The nixlbench and test changes make the new path measurable and verifiable in CI without requiring RDMA hardware.

How?

Two independent channels. Data plane is RDMA (zero-copy, GPU-direct); control plane is an HTTP request with an empty body whose x-scal-rdma header carries the RDMA token. Scality performs the actual RDMA transfer against the registered buffer.
Concurrency. A single dedicated poller thread owns the curl multi handle and runs the non-blocking event loop, so in-flight request count is decoupled from the thread count (curl only carries the tiny command; bulk data is on RDMA). Completion callbacks are offloaded to a num_threads-sized worker pool so a slow callback cannot stall the loop.
Multi-GPU correctness. registerMem/deregisterMem set the current CUDA device to the buffer's GPU before cuMemObjGet/PutDescriptor, so multi-GPU registrations do not fail cuFile's device-memory check.
DC-only transport via a thin CuObjRdmaTokenClient wrapper over NVIDIA cuObjClient (CUOBJ_PROTO_RDMA_DC_V1), behind the iRdmaTokenClient seam (also the test-injection point). The cuObject stack is required at build and run time; without it the engine is not compiled in.

borisfaure · 2026-06-10T15:05:52Z

+|---|---|---|---|
+| `accelerated` | yes | n/a | Must be `"true"` to request an accelerated engine |
+| `type` | yes | n/a | Must be `"scality_ai_connector"` to select this engine |
+| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |


Suggested change

| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |

| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:10000` |

borisfaure · 2026-06-10T15:43:51Z

+| `rdma_token_client.h` | `iRdmaTokenClient`: the data-plane interface and the `getCtx()` helper. |
+| `cuobj_rdma_token_client.{h,cpp}` | `CuObjRdmaTokenClient`: the DC RDMA implementation, a thin wrapper over NVIDIA `cuObjClient` (`CUOBJ_PROTO_RDMA_DC_V1`). |
+
+This is a **DC-only** build. The `iRdmaTokenClient` abstraction is the seam


I don't understand this sentence.

I'll remove it.
I've kept the iRdmaTokenClient abstraction for our RC implementation that we don't want to expose (it's useful for local development but it doesn't scale).

Mrkgoud · 2026-06-11T08:02:10Z

+- **Data plane (RDMA):** the object bytes move directly between the local
+  buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in
+  GPU memory never needs a staging copy through host memory.
+- **Control plane (HTTP):** a tiny HTTP request tells Scality *what* to do


Tiny or small? above it was small.

Mrkgoud · 2026-06-11T08:51:11Z

+
+# Scality AI Connector
+
+A NIXL **OBJ-plugin engine** that moves object data to and from a Scality


Suggested change

A NIXL **OBJ-plugin engine** that moves object data to and from a Scality

A NIXL (NVIDIA Inference Xfer Library) **OBJ-plugin engine** that moves object data to and from a Scality

Mrkgoud · 2026-06-11T08:52:50Z

+The connector splits each transfer into two independent channels:
+
+- **Data plane (RDMA):** the object bytes move directly between the local
+  buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in


Suggested change

buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in

buffer and Scality over RDMA, zero-copy and GPUDirect-capable, so data in

Mrkgoud · 2026-06-11T08:59:46Z

+interface, so the in-flight request count does not depend on `num_threads` (the
+bulk data moves over RDMA; curl only carries the tiny command). Each completed
+request's callback runs in a worker pool sized by `num_threads`, so a slow
+callback cannot stall the poller. Size `num_threads` to how much work the


Suggested change

callback cannot stall the poller. Size `num_threads` to how much work the

callback cannot stall the poller. Size `num_threads` accordint to the amount of work each

alternative but feel free to dismiss

Mrkgoud · 2026-06-11T09:00:48Z

+bulk data moves over RDMA; curl only carries the tiny command). Each completed
+request's callback runs in a worker pool sized by `num_threads`, so a slow
+callback cannot stall the poller. Size `num_threads` to how much work the
+callbacks do; the default suits cheap callbacks.


Suggested change

callbacks do; the default suits cheap callbacks.

callback does. The default value is sufficient for lighweight callbacks.

Avoid the use of cheap, it can feel pejorative

Mrkgoud · 2026-06-11T09:13:43Z

+
+## Troubleshooting
+
+| Symptom (in logs) | Likely cause / fix |


GiorgioRegni · 2026-06-12T21:24:36Z

+
+# Scality AI Connector
+
+A NIXL **OBJ-plugin engine** that moves object data to and from a Scality


A plugin engine for the OBJ backend of
NIXL, the NVIDIA Inference Xfer
Library: one API for moving data between HBM, DRAM, NVMe, file, and object
storage, with pluggable backends.

This engine connects NIXL to a Scality endpoint and splits each transfer
into two planes: the data plane is RDMA, ; the control
plane is a header-only HTTP request that names the operation and the
object key. Object bytes never travel over HTTP.

GiorgioRegni · 2026-06-12T21:29:26Z

+
+## Prerequisites
+
+- **NVIDIA cuObject / GPUDirect Storage** installed and working. This provides


Is there a specific/minimum version we require?

Nothing special on our side. This will work with CUDA 13+ (which is what NIXL require)

GiorgioRegni · 2026-06-12T21:30:56Z

+# Scality AI Connector
+
+A NIXL **OBJ-plugin engine** that moves object data to and from a Scality
+endpoint over **RDMA**, using a small HTTP request only to carry the command.


I see sometimes small or tiny, I propose header-only HTTP request

GiorgioRegni · 2026-06-12T21:32:22Z

+
+- **DC transport only.** There is no fallback to other RDMA transports.
+- **4 GiB** maximum per memory registration (a cuObject limit).
+- Data is **never** sent over HTTP; the HTTP body is always empty by design.


That's not a limitation, that's why it's being developed, it's more of a design guarantee.

GiorgioRegni · 2026-06-12T21:33:44Z

+- A reachable **Scality AI Connector REST endpoint**.
+- Build dependencies: `cuobjclient` and `libcurl`.
+
+## Configuration


We need a ** Building ** section to explain how to build it from source, no?

GiorgioRegni · 2026-06-12T21:37:33Z

+| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |
+| `num_threads` | no | `max(2, cpu_threads / 4)` | Size of the callback worker pool (see [Concurrency model](#concurrency-model)). |
+
+Requests are issued to `{endpoint_override}/{key}`.


Also what about key generation? I though it was by path but it's by key?
If by key we need to explain the key format.

Also what are the parameters for object size? Is there a max size? What's is our recommendation for object size? This should be explained as well.

It is by path. I've changed the term key to object id hopping to avoid confusion.

The max size of the object is 4GiB (the size is coded on 32bits in the descriptor generated by cuObject)
I think on our side we're going to use this with the new striping ring driver that avoid the 512MiB limit that we have on ring driver chord

cuObjClient ships its pkg-config file with a CUDA-version suffix (e.g. cuobjclient-13.3). Detect the highest versioned name so the OBJ accelerated engines build without hard-coding a version. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

Add an OBJ-plugin accelerated engine that moves object data to/from a Scality AI Connector endpoint over RDMA (cuObject DC transport), using a small HTTP request to carry the cuObject descriptor in an x-scal-rdma header. The data plane is RDMA (GPU-direct capable); the HTTP control plane has an empty body. The HTTP control-plane client (RestClient) drives all requests with libcurl's multi interface from a single dedicated poller thread; completion callbacks are dispatched to a small worker pool. num_threads sizes that pool. registerMem/deregisterMem set the current CUDA device to the buffer's GPU before cuMemObjGetDescriptor/PutDescriptor so multi-GPU registrations don't fail cuFile's device-memory check. cuda_dep is linked for the runtime API. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

Wire-format tests assert the connector sends correct PUT/GET/HEAD requests (URL, x-scal-rdma header, empty body) against a loopback server, with no RDMA/cuObject needed. Concurrency tests prove the curl_multi poller keeps more requests in flight than its thread count and that teardown with outstanding requests fires every callback exactly once. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

Seed objects for READ benchmarks concurrently across an OpenMP region. Each putObj is one network-bound HTTP PUT. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

Wire up the OBJ accelerated path in nixlbench (flags, REST object pre-population/cleanup helpers, kvbench args) so the Scality AI Connector engine can be benchmarked. Allocate VRAM on the requested CUDA device so multi-GPU runs spread across devices. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

cuObject bakes a single primary device GID (one NIC) into each RDMA token, chosen from the current CUDA device at registration time. Add a targetCudaDevice() helper (device_select.h) and wire it into register / deregister so each MR binds to the right device: VRAM to the buffer's own GPU, DRAM round-robin across GPUs so host buffers fan across NICs. The CUDA device count is queried once at construction (gpuCount_, 0 when no GPU is present, in which case the current device is left unchanged). Add device_select_test.cpp covering the VRAM / DRAM / no-GPU cases. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

- rename client.{h,cpp} to rest_client.{h,cpp} to match RestClient - split RestClient::finishRequest into dispatchHeadResult / dispatchXferResult; guard empty callbacks before posting and warn (not swallow) on callback exceptions - make iRestClient / iRdmaTokenClient destructors protected non-virtual (only ever owned via shared_ptr) and add [[nodiscard]] to query methods - make ScalityObjEngineImpl::cuClient_ / connectorClient_ const, initialized in the member initializer list via makeConnectorClient - pass connector_client by const reference - drop the C-style typedef on rdma_ctx_t and redundant string initializers - minor const / comment cleanups Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

oliviergaraud force-pushed the scality-ai-connector-dc branch from 3ed2c14 to 9994385 Compare June 10, 2026 15:41

borisfaure reviewed Jun 10, 2026

View reviewed changes

oliviergaraud force-pushed the scality-ai-connector-dc branch from 9994385 to 76fbc75 Compare June 10, 2026 15:48

Mrkgoud reviewed Jun 11, 2026

View reviewed changes

GiorgioRegni reviewed Jun 12, 2026

View reviewed changes

oliviergaraud force-pushed the scality-ai-connector-dc branch from 76fbc75 to e484f58 Compare June 15, 2026 15:21

oliviergaraud force-pushed the main branch from 76db311 to 82d78d6 Compare June 15, 2026 15:21

oliviergaraud force-pushed the scality-ai-connector-dc branch 3 times, most recently from 2c5f4af to e22aa78 Compare June 15, 2026 17:26

oliviergaraud added 3 commits June 16, 2026 08:13

perf(nixlbench): parallelize OBJ object pre-population

0f97863

Seed objects for READ benchmarks concurrently across an OpenMP region. Each putObj is one network-bound HTTP PUT. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>

oliviergaraud force-pushed the scality-ai-connector-dc branch from e22aa78 to de07520 Compare June 16, 2026 06:14

oliviergaraud force-pushed the scality-ai-connector-dc branch from a724a32 to f5e3a26 Compare June 16, 2026 09:06

oliviergaraud added 2 commits June 25, 2026 17:17

	\| `endpoint_override` \| yes \| n/a \| Base URL of the connector, e.g. `http://10.0.0.1:81` \|
	\| `endpoint_override` \| yes \| n/a \| Base URL of the connector, e.g. `http://10.0.0.1:10000` \|


		# Scality AI Connector

		A NIXL OBJ-plugin engine that moves object data to and from a Scality

	A NIXL OBJ-plugin engine that moves object data to and from a Scality
	A NIXL (NVIDIA Inference Xfer Library) OBJ-plugin engine that moves object data to and from a Scality

	buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in
	buffer and Scality over RDMA, zero-copy and GPUDirect-capable, so data in

	callback cannot stall the poller. Size `num_threads` to how much work the
	callback cannot stall the poller. Size `num_threads` accordint to the amount of work each

	callbacks do; the default suits cheap callbacks.
	callback does. The default value is sufficient for lighweight callbacks.


		## Troubleshooting

		\| Symptom (in logs) \| Likely cause / fix \|

	\| Symptom (in logs) \| Likely cause / fix \|
	\| Symptom (in logs) \| Potential cause / fix \|


		## Prerequisites

		- NVIDIA cuObject / GPUDirect Storage installed and working. This provides

Uh oh!

Conversation

oliviergaraud commented Jun 10, 2026

What?

Why?

How?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GiorgioRegni Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GiorgioRegni Jun 12, 2026 •

edited

Loading