Skip to content

Introduce Scality AI Connector#2

Open
oliviergaraud wants to merge 7 commits into
mainfrom
scality-ai-connector-dc
Open

Introduce Scality AI Connector#2
oliviergaraud wants to merge 7 commits into
mainfrom
scality-ai-connector-dc

Conversation

@oliviergaraud

Copy link
Copy Markdown

What?

Adds a new NIXL OBJ-plugin accelerated engine, "Scality AI Connector", that moves object data to and from a Scality endpoint over RDMA (cuObject DC transport), using a small HTTP request only to carry the command.

The PR includes:

  • The engine (src/plugins/obj/rest_accel/scality_ai_connector/): a two-plane design where object bytes move over RDMA (GPU-direct capable) while a tiny HTTP PUT/GET with an empty body carries the cuObject RDMA descriptor in an x-scal-rdma header. The HTTP control-plane client (RestClient) drives all requests with libcurl's multi interface from a single poller thread, dispatching completion callbacks to a worker pool sized by num_threads.
  • Build detection: discovers the version-suffixed cuobjclient-<ver> pkg-config name so the accelerated engines build without hard-coding a CUDA version.
  • Unit tests: HTTP wire-format tests (correct PUT/GET/HEAD, header, empty body, against a loopback server, no RDMA/cuObject required) and concurrency tests (poller keeps more requests in flight than its thread count; teardown with outstanding requests fires every callback exactly once).
  • nixlbench integration: wires up the OBJ accelerated path (flags, REST object pre-population/cleanup, kvbench args) so the engine can be benchmarked, with concurrent (OpenMP) object seeding/teardown.

Why?

Scality's AI Connector speaks its own REST dialect and is not S3-compatible, so the existing s3/s3_crt engines cannot target it. This engine lets NIXL move large buffers (often in GPU memory) to and from a Scality store over RDMA/GPUDirect, with HTTP used purely as a lightweight control plane. The nixlbench and test changes make the new path measurable and verifiable in CI without requiring RDMA hardware.

How?

  • Two independent channels. Data plane is RDMA (zero-copy, GPU-direct); control plane is an HTTP request with an empty body whose x-scal-rdma header carries the RDMA token. Scality performs the actual RDMA transfer against the registered buffer.
  • Concurrency. A single dedicated poller thread owns the curl multi handle and runs the non-blocking event loop, so in-flight request count is decoupled from the thread count (curl only carries the tiny command; bulk data is on RDMA). Completion callbacks are offloaded to a num_threads-sized worker pool so a slow callback cannot stall the loop.
  • Multi-GPU correctness. registerMem/deregisterMem set the current CUDA device to the buffer's GPU before cuMemObjGet/PutDescriptor, so multi-GPU registrations do not fail cuFile's device-memory check.
  • DC-only transport via a thin CuObjRdmaTokenClient wrapper over NVIDIA cuObjClient (CUOBJ_PROTO_RDMA_DC_V1), behind the iRdmaTokenClient seam (also the test-injection point). The cuObject stack is required at build and run time; without it the engine is not compiled in.

@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch from 3ed2c14 to 9994385 Compare June 10, 2026 15:41
|---|---|---|---|
| `accelerated` | yes | n/a | Must be `"true"` to request an accelerated engine |
| `type` | yes | n/a | Must be `"scality_ai_connector"` to select this engine |
| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |
| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:10000` |

| `rdma_token_client.h` | `iRdmaTokenClient`: the data-plane interface and the `getCtx()` helper. |
| `cuobj_rdma_token_client.{h,cpp}` | `CuObjRdmaTokenClient`: the DC RDMA implementation, a thin wrapper over NVIDIA `cuObjClient` (`CUOBJ_PROTO_RDMA_DC_V1`). |

This is a **DC-only** build. The `iRdmaTokenClient` abstraction is the seam

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it.
I've kept the iRdmaTokenClient abstraction for our RC implementation that we don't want to expose (it's useful for local development but it doesn't scale).

@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch from 9994385 to 76fbc75 Compare June 10, 2026 15:48
- **Data plane (RDMA):** the object bytes move directly between the local
buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in
GPU memory never needs a staging copy through host memory.
- **Control plane (HTTP):** a tiny HTTP request tells Scality *what* to do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny or small? above it was small.


# Scality AI Connector

A NIXL **OBJ-plugin engine** that moves object data to and from a Scality

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A NIXL **OBJ-plugin engine** that moves object data to and from a Scality
A NIXL (NVIDIA Inference Xfer Library) **OBJ-plugin engine** that moves object data to and from a Scality

The connector splits each transfer into two independent channels:

- **Data plane (RDMA):** the object bytes move directly between the local
buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in
buffer and Scality over RDMA, zero-copy and GPUDirect-capable, so data in

interface, so the in-flight request count does not depend on `num_threads` (the
bulk data moves over RDMA; curl only carries the tiny command). Each completed
request's callback runs in a worker pool sized by `num_threads`, so a slow
callback cannot stall the poller. Size `num_threads` to how much work the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
callback cannot stall the poller. Size `num_threads` to how much work the
callback cannot stall the poller. Size `num_threads` accordint to the amount of work each

alternative but feel free to dismiss

bulk data moves over RDMA; curl only carries the tiny command). Each completed
request's callback runs in a worker pool sized by `num_threads`, so a slow
callback cannot stall the poller. Size `num_threads` to how much work the
callbacks do; the default suits cheap callbacks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
callbacks do; the default suits cheap callbacks.
callback does. The default value is sufficient for lighweight callbacks.

Avoid the use of cheap, it can feel pejorative


## Troubleshooting

| Symptom (in logs) | Likely cause / fix |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Symptom (in logs) | Likely cause / fix |
| Symptom (in logs) | Potential cause / fix |


# Scality AI Connector

A NIXL **OBJ-plugin engine** that moves object data to and from a Scality

@GiorgioRegni GiorgioRegni Jun 12, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A plugin engine for the OBJ backend of
NIXL, the NVIDIA Inference Xfer
Library: one API for moving data between HBM, DRAM, NVMe, file, and object
storage, with pluggable backends.

This engine connects NIXL to a Scality endpoint and splits each transfer
into two planes: the data plane is RDMA, ; the control
plane
is a header-only HTTP request that names the operation and the
object key. Object bytes never travel over HTTP.


## Prerequisites

- **NVIDIA cuObject / GPUDirect Storage** installed and working. This provides

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific/minimum version we require?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special on our side. This will work with CUDA 13+ (which is what NIXL require)

# Scality AI Connector

A NIXL **OBJ-plugin engine** that moves object data to and from a Scality
endpoint over **RDMA**, using a small HTTP request only to carry the command.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see sometimes small or tiny, I propose header-only HTTP request


- **DC transport only.** There is no fallback to other RDMA transports.
- **4 GiB** maximum per memory registration (a cuObject limit).
- Data is **never** sent over HTTP; the HTTP body is always empty by design.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a limitation, that's why it's being developed, it's more of a design guarantee.

- A reachable **Scality AI Connector REST endpoint**.
- Build dependencies: `cuobjclient` and `libcurl`.

## Configuration

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a ** Building ** section to explain how to build it from source, no?

| `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` |
| `num_threads` | no | `max(2, cpu_threads / 4)` | Size of the callback worker pool (see [Concurrency model](#concurrency-model)). |

Requests are issued to `{endpoint_override}/{key}`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what about key generation? I though it was by path but it's by key?
If by key we need to explain the key format.

Also what are the parameters for object size? Is there a max size? What's is our recommendation for object size? This should be explained as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is by path. I've changed the term key to object id hopping to avoid confusion.

The max size of the object is 4GiB (the size is coded on 32bits in the descriptor generated by cuObject)
I think on our side we're going to use this with the new striping ring driver that avoid the 512MiB limit that we have on ring driver chord

cuObjClient ships its pkg-config file with a CUDA-version suffix
(e.g. cuobjclient-13.3). Detect the highest versioned name so the OBJ
accelerated engines build without hard-coding a version.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch from 76fbc75 to e484f58 Compare June 15, 2026 15:21
@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch 3 times, most recently from 2c5f4af to e22aa78 Compare June 15, 2026 17:26
Add an OBJ-plugin accelerated engine that moves object data to/from a
Scality AI Connector endpoint over RDMA (cuObject DC transport), using a
small HTTP request to carry the cuObject descriptor in an x-scal-rdma
header. The data plane is RDMA (GPU-direct capable); the HTTP control
plane has an empty body.

The HTTP control-plane client (RestClient) drives all requests with
libcurl's multi interface from a single dedicated poller thread;
completion callbacks are dispatched to a small worker pool.
num_threads sizes that pool.

registerMem/deregisterMem set the current CUDA device to the buffer's GPU
before cuMemObjGetDescriptor/PutDescriptor so multi-GPU registrations
don't fail cuFile's device-memory check.
cuda_dep is linked for the runtime API.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
Wire-format tests assert the connector sends correct PUT/GET/HEAD requests
(URL, x-scal-rdma header, empty body) against a loopback server, with no
RDMA/cuObject needed. Concurrency tests prove the curl_multi poller keeps
more requests in flight than its thread count and that teardown with
outstanding requests fires every callback exactly once.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
Seed objects for READ benchmarks concurrently across an OpenMP region.
Each putObj is one network-bound HTTP PUT.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch from e22aa78 to de07520 Compare June 16, 2026 06:14
Wire up the OBJ accelerated path in nixlbench (flags, REST object
pre-population/cleanup helpers, kvbench args) so the Scality AI Connector
engine can be benchmarked. Allocate VRAM on the requested CUDA device so
multi-GPU runs spread across devices.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
@oliviergaraud oliviergaraud force-pushed the scality-ai-connector-dc branch from a724a32 to f5e3a26 Compare June 16, 2026 09:06
cuObject bakes a single primary device GID (one NIC) into each RDMA
token, chosen from the current CUDA device at registration time. Add a
targetCudaDevice() helper (device_select.h) and wire it into register /
deregister so each MR binds to the right device: VRAM to the buffer's
own GPU, DRAM round-robin across GPUs so host buffers fan across NICs.
The CUDA device count is queried once at construction (gpuCount_, 0 when
no GPU is present, in which case the current device is left unchanged).

Add device_select_test.cpp covering the VRAM / DRAM / no-GPU cases.

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
- rename client.{h,cpp} to rest_client.{h,cpp} to match RestClient
- split RestClient::finishRequest into dispatchHeadResult /
  dispatchXferResult; guard empty callbacks before posting and warn
  (not swallow) on callback exceptions
- make iRestClient / iRdmaTokenClient destructors protected non-virtual
  (only ever owned via shared_ptr) and add [[nodiscard]] to query methods
- make ScalityObjEngineImpl::cuClient_ / connectorClient_ const,
  initialized in the member initializer list via makeConnectorClient
- pass connector_client by const reference
- drop the C-style typedef on rdma_ctx_t and redundant string initializers
- minor const / comment cleanups

Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants