Introduce Scality AI Connector#2
Conversation
3ed2c14 to
9994385
Compare
| |---|---|---|---| | ||
| | `accelerated` | yes | n/a | Must be `"true"` to request an accelerated engine | | ||
| | `type` | yes | n/a | Must be `"scality_ai_connector"` to select this engine | | ||
| | `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` | |
There was a problem hiding this comment.
| | `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` | | |
| | `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:10000` | |
| | `rdma_token_client.h` | `iRdmaTokenClient`: the data-plane interface and the `getCtx()` helper. | | ||
| | `cuobj_rdma_token_client.{h,cpp}` | `CuObjRdmaTokenClient`: the DC RDMA implementation, a thin wrapper over NVIDIA `cuObjClient` (`CUOBJ_PROTO_RDMA_DC_V1`). | | ||
|
|
||
| This is a **DC-only** build. The `iRdmaTokenClient` abstraction is the seam |
There was a problem hiding this comment.
I'll remove it.
I've kept the iRdmaTokenClient abstraction for our RC implementation that we don't want to expose (it's useful for local development but it doesn't scale).
9994385 to
76fbc75
Compare
| - **Data plane (RDMA):** the object bytes move directly between the local | ||
| buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in | ||
| GPU memory never needs a staging copy through host memory. | ||
| - **Control plane (HTTP):** a tiny HTTP request tells Scality *what* to do |
|
|
||
| # Scality AI Connector | ||
|
|
||
| A NIXL **OBJ-plugin engine** that moves object data to and from a Scality |
There was a problem hiding this comment.
| A NIXL **OBJ-plugin engine** that moves object data to and from a Scality | |
| A NIXL (NVIDIA Inference Xfer Library) **OBJ-plugin engine** that moves object data to and from a Scality |
| The connector splits each transfer into two independent channels: | ||
|
|
||
| - **Data plane (RDMA):** the object bytes move directly between the local | ||
| buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in |
There was a problem hiding this comment.
| buffer and Scality over RDMA, zero-copy and GPU-direct capable, so data in | |
| buffer and Scality over RDMA, zero-copy and GPUDirect-capable, so data in |
| interface, so the in-flight request count does not depend on `num_threads` (the | ||
| bulk data moves over RDMA; curl only carries the tiny command). Each completed | ||
| request's callback runs in a worker pool sized by `num_threads`, so a slow | ||
| callback cannot stall the poller. Size `num_threads` to how much work the |
There was a problem hiding this comment.
| callback cannot stall the poller. Size `num_threads` to how much work the | |
| callback cannot stall the poller. Size `num_threads` accordint to the amount of work each |
alternative but feel free to dismiss
| bulk data moves over RDMA; curl only carries the tiny command). Each completed | ||
| request's callback runs in a worker pool sized by `num_threads`, so a slow | ||
| callback cannot stall the poller. Size `num_threads` to how much work the | ||
| callbacks do; the default suits cheap callbacks. |
There was a problem hiding this comment.
| callbacks do; the default suits cheap callbacks. | |
| callback does. The default value is sufficient for lighweight callbacks. |
Avoid the use of cheap, it can feel pejorative
|
|
||
| ## Troubleshooting | ||
|
|
||
| | Symptom (in logs) | Likely cause / fix | |
There was a problem hiding this comment.
| | Symptom (in logs) | Likely cause / fix | | |
| | Symptom (in logs) | Potential cause / fix | |
|
|
||
| # Scality AI Connector | ||
|
|
||
| A NIXL **OBJ-plugin engine** that moves object data to and from a Scality |
There was a problem hiding this comment.
A plugin engine for the OBJ backend of
NIXL, the NVIDIA Inference Xfer
Library: one API for moving data between HBM, DRAM, NVMe, file, and object
storage, with pluggable backends.
This engine connects NIXL to a Scality endpoint and splits each transfer
into two planes: the data plane is RDMA, ; the control
plane is a header-only HTTP request that names the operation and the
object key. Object bytes never travel over HTTP.
|
|
||
| ## Prerequisites | ||
|
|
||
| - **NVIDIA cuObject / GPUDirect Storage** installed and working. This provides |
There was a problem hiding this comment.
Is there a specific/minimum version we require?
There was a problem hiding this comment.
Nothing special on our side. This will work with CUDA 13+ (which is what NIXL require)
| # Scality AI Connector | ||
|
|
||
| A NIXL **OBJ-plugin engine** that moves object data to and from a Scality | ||
| endpoint over **RDMA**, using a small HTTP request only to carry the command. |
There was a problem hiding this comment.
I see sometimes small or tiny, I propose header-only HTTP request
|
|
||
| - **DC transport only.** There is no fallback to other RDMA transports. | ||
| - **4 GiB** maximum per memory registration (a cuObject limit). | ||
| - Data is **never** sent over HTTP; the HTTP body is always empty by design. |
There was a problem hiding this comment.
That's not a limitation, that's why it's being developed, it's more of a design guarantee.
| - A reachable **Scality AI Connector REST endpoint**. | ||
| - Build dependencies: `cuobjclient` and `libcurl`. | ||
|
|
||
| ## Configuration |
There was a problem hiding this comment.
We need a ** Building ** section to explain how to build it from source, no?
| | `endpoint_override` | yes | n/a | Base URL of the connector, e.g. `http://10.0.0.1:81` | | ||
| | `num_threads` | no | `max(2, cpu_threads / 4)` | Size of the callback worker pool (see [Concurrency model](#concurrency-model)). | | ||
|
|
||
| Requests are issued to `{endpoint_override}/{key}`. |
There was a problem hiding this comment.
Also what about key generation? I though it was by path but it's by key?
If by key we need to explain the key format.
Also what are the parameters for object size? Is there a max size? What's is our recommendation for object size? This should be explained as well.
There was a problem hiding this comment.
It is by path. I've changed the term key to object id hopping to avoid confusion.
The max size of the object is 4GiB (the size is coded on 32bits in the descriptor generated by cuObject)
I think on our side we're going to use this with the new striping ring driver that avoid the 512MiB limit that we have on ring driver chord
cuObjClient ships its pkg-config file with a CUDA-version suffix (e.g. cuobjclient-13.3). Detect the highest versioned name so the OBJ accelerated engines build without hard-coding a version. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
76fbc75 to
e484f58
Compare
2c5f4af to
e22aa78
Compare
Add an OBJ-plugin accelerated engine that moves object data to/from a Scality AI Connector endpoint over RDMA (cuObject DC transport), using a small HTTP request to carry the cuObject descriptor in an x-scal-rdma header. The data plane is RDMA (GPU-direct capable); the HTTP control plane has an empty body. The HTTP control-plane client (RestClient) drives all requests with libcurl's multi interface from a single dedicated poller thread; completion callbacks are dispatched to a small worker pool. num_threads sizes that pool. registerMem/deregisterMem set the current CUDA device to the buffer's GPU before cuMemObjGetDescriptor/PutDescriptor so multi-GPU registrations don't fail cuFile's device-memory check. cuda_dep is linked for the runtime API. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
Wire-format tests assert the connector sends correct PUT/GET/HEAD requests (URL, x-scal-rdma header, empty body) against a loopback server, with no RDMA/cuObject needed. Concurrency tests prove the curl_multi poller keeps more requests in flight than its thread count and that teardown with outstanding requests fires every callback exactly once. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
Seed objects for READ benchmarks concurrently across an OpenMP region. Each putObj is one network-bound HTTP PUT. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
e22aa78 to
de07520
Compare
Wire up the OBJ accelerated path in nixlbench (flags, REST object pre-population/cleanup helpers, kvbench args) so the Scality AI Connector engine can be benchmarked. Allocate VRAM on the requested CUDA device so multi-GPU runs spread across devices. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
a724a32 to
f5e3a26
Compare
cuObject bakes a single primary device GID (one NIC) into each RDMA token, chosen from the current CUDA device at registration time. Add a targetCudaDevice() helper (device_select.h) and wire it into register / deregister so each MR binds to the right device: VRAM to the buffer's own GPU, DRAM round-robin across GPUs so host buffers fan across NICs. The CUDA device count is queried once at construction (gpuCount_, 0 when no GPU is present, in which case the current device is left unchanged). Add device_select_test.cpp covering the VRAM / DRAM / no-GPU cases. Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
- rename client.{h,cpp} to rest_client.{h,cpp} to match RestClient
- split RestClient::finishRequest into dispatchHeadResult /
dispatchXferResult; guard empty callbacks before posting and warn
(not swallow) on callback exceptions
- make iRestClient / iRdmaTokenClient destructors protected non-virtual
(only ever owned via shared_ptr) and add [[nodiscard]] to query methods
- make ScalityObjEngineImpl::cuClient_ / connectorClient_ const,
initialized in the member initializer list via makeConnectorClient
- pass connector_client by const reference
- drop the C-style typedef on rdma_ctx_t and redundant string initializers
- minor const / comment cleanups
Signed-off-by: Olivier Garaud <olivier.garaud@scality.com>
What?
Adds a new NIXL OBJ-plugin accelerated engine, "Scality AI Connector", that moves object data to and from a Scality endpoint over RDMA (cuObject DC transport), using a small HTTP request only to carry the command.
The PR includes:
src/plugins/obj/rest_accel/scality_ai_connector/): a two-plane design where object bytes move over RDMA (GPU-direct capable) while a tiny HTTPPUT/GETwith an empty body carries the cuObject RDMA descriptor in anx-scal-rdmaheader. The HTTP control-plane client (RestClient) drives all requests with libcurl's multi interface from a single poller thread, dispatching completion callbacks to a worker pool sized bynum_threads.cuobjclient-<ver>pkg-config name so the accelerated engines build without hard-coding a CUDA version.PUT/GET/HEAD, header, empty body, against a loopback server, no RDMA/cuObject required) and concurrency tests (poller keeps more requests in flight than its thread count; teardown with outstanding requests fires every callback exactly once).Why?
Scality's AI Connector speaks its own REST dialect and is not S3-compatible, so the existing
s3/s3_crtengines cannot target it. This engine lets NIXL move large buffers (often in GPU memory) to and from a Scality store over RDMA/GPUDirect, with HTTP used purely as a lightweight control plane. The nixlbench and test changes make the new path measurable and verifiable in CI without requiring RDMA hardware.How?
x-scal-rdmaheader carries the RDMA token. Scality performs the actual RDMA transfer against the registered buffer.num_threads-sized worker pool so a slow callback cannot stall the loop.registerMem/deregisterMemset the current CUDA device to the buffer's GPU beforecuMemObjGet/PutDescriptor, so multi-GPU registrations do not fail cuFile's device-memory check.CuObjRdmaTokenClientwrapper over NVIDIAcuObjClient(CUOBJ_PROTO_RDMA_DC_V1), behind theiRdmaTokenClientseam (also the test-injection point). ThecuObjectstack is required at build and run time; without it the engine is not compiled in.