Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
.idea/
.claude/
cmake-build-debug*/
.claude/
CLAUDE.md

*.onnx

Expand Down
80 changes: 76 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ sudo cmake --install build --prefix /usr/local/onnxruntime-server
| `--workers` | `ONNX_SERVER_WORKERS` | Worker thread pool size.<br/>Default: `4` |
| `--request-payload-limit` | `ONNX_SERVER_REQUEST_PAYLOAD_LIMIT` | HTTP/HTTPS request payload size limit.<br />Default: 1024 * 1024 * 10(10MB)` |
| `--model-dir` | `ONNX_SERVER_MODEL_DIR` | Model directory path<br/>The onnx model files must be located in the following path:<br/>`${model_dir}/${model_name}/${model_version}/model.onnx` or<br/>`${model_dir}/${model_name}/${model_version}.onnx`<br/>Default: `models` |
| `--prepare-model` | `ONNX_SERVER_PREPARE_MODEL` | Pre-create some model sessions at server startup.<br/><br/>Format as a space-separated list of `model_name:model_version` or `model_name:model_version(session_options, ...)`.<br/><br/>Available session_options are<br/>- cuda=device_id`[ or true or false]`<br/><br/>eg) `model1:v1 model2:v9`<br/>`model1:v1(cuda=true) model2:v9(cuda=1)` |
| `--prepare-model` | `ONNX_SERVER_PREPARE_MODEL` | Pre-create some model sessions at server startup.<br/><br/>Format as a space-separated list of `model_name:model_version` or `model_name:model_version(opt1=val1, opt2=val2, ...)`. Option keys may use dotted notation to address nested groups (e.g. `cuda.device_id`, `session_options.intra_op_num_threads`). Repeating the `extensions` key accumulates a deduplicated array. Option entries that do not match the grammar are skipped silently rather than failing the whole list.<br/><br/>Examples:<br/>- `model1:v1 model2:v9`<br/>- `model1:v1(cuda=true) model2:v9(cuda=1)`<br/>- `bert:v1(cuda.device_id=0, cuda.gpu_mem_limit=2147483648)`<br/>- `bert:v1(session_options.intra_op_num_threads=4, session_options.graph_optimization_level=all)`<br/>- `bert:v1(extensions=/usr/local/lib/libortextensions.so)` |

### Backend options

Expand Down Expand Up @@ -223,20 +223,92 @@ docker run --name onnxruntime_server_container -d --rm --gpus all \

## ONNXRuntime Extensions Support

To use the [onnxruntime-extensions](https://github.com/microsoft/onnxruntime-extensions)(Custom Ops Library), set the
options as follows when creating a session.
To use the [onnxruntime-extensions](https://github.com/microsoft/onnxruntime-extensions) (Custom Ops Library), supply
one or more library paths through the `extensions` array. The server registers each path with ORT in order and
deduplicates entries.

```json
{
"model": "string",
"version": "string",
"option": {
"cuda": ...,
"ortextensions_path": "/absolute/path/to/libonnxruntime_extensions.so"
"extensions": [
"/absolute/path/to/libonnxruntime_extensions.so"
]
}
}
```

The legacy `ortextensions_path` (single string) is still accepted for backward compatibility; it is normalized into the
`extensions` array on the server side and the response always echoes the normalized form.

## Session-level options

The optional `session_options` object on a session-create request forwards the listed keys to the underlying
onnxruntime `SessionOptions`. Only the JSON shape (types and our enum-string mapping) is validated on the server side;
the actual value validation is delegated to ORT, and the response echoes only the values ORT accepted.

```json
{
"model": "string",
"version": "string",
"option": {
"session_options": {
"intra_op_num_threads": 4,
"inter_op_num_threads": 1,
"execution_mode": "sequential",
"graph_optimization_level": "all",
"enable_cpu_mem_arena": true,
"enable_mem_pattern": true,
"log_severity_level": 2,
"logid": "my-model",
"enable_profiling": false,
"profile_file_prefix": "/var/log/onnx/profile-",
"optimized_model_filepath": "/cache/optimized.onnx",
"free_dimension_overrides": { "batch": 1 },
"config_entries": {
"session.disable_prepacking": "1"
}
}
}
}
```

`config_entries` is round-tripped through `GetSessionConfigEntry`, so the response shows what ORT actually stored
(string values; `true`/`42` become `"1"`/`"42"`).

## CUDA execution provider options

When CUDA is enabled, the `cuda` field accepts either a boolean / integer (legacy shorthand) or an object that maps to
[CUDA Execution Provider options](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html). The
server forwards the object to ORT via `UpdateCUDAProviderOptions` in a single batched call (per-key calls trigger a
sibling-reset quirk in ORT V2). If any key is rejected by ORT, session creation fails with the ORT error message
identifying the offending key. The response is built from `GetCUDAProviderOptionsAsString` readback, so it reflects
exactly what ORT stored.

```json
{
"model": "string",
"version": "string",
"option": {
"cuda": {
"device_id": 0,
"gpu_mem_limit": 2147483648,
"arena_extend_strategy": "kNextPowerOfTwo",
"cudnn_conv_algo_search": "EXHAUSTIVE",
"cudnn_conv_use_max_workspace": true,
"do_copy_in_default_stream": true,
"enable_cuda_graph": false
}
}
}
```

Backward-compatible shortcuts:
- `"cuda": true` — enable CUDA with all defaults (`device_id=0`).
- `"cuda": 1` — enable CUDA on `device_id=1`.

For more details on the session creation request, please refer to
the [API documentation](https://kibae.github.io/onnxruntime-server/swagger/#/ONNX%20Runtime%20Session/createSession).

Expand Down
161 changes: 155 additions & 6 deletions docs/swagger/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -269,15 +269,30 @@ components:
$ref: '#/components/schemas/ONNXSessionOption'
ONNXSessionOption:
type: object
description: |
Normalized echo of the options applied to the session. The server only includes
keys whose corresponding ORT calls succeeded; values reflect what ORT actually
stored (read back via GetCUDAProviderOptionsAsString and GetSessionConfigEntry
where applicable).
nullable: true
properties:
cuda:
nullable: true
required: false
oneOf:
- type: boolean
description: Use CUDA
description: CUDA disabled (false) — present for backward compatibility.
- $ref: '#/components/schemas/ONNXSessionOptionCUDA'
extensions:
type: array
description: Registered onnxruntime-extensions library paths in registration order, deduplicated.
required: false
items:
type: string
example:
- /absolute/path/to/libonnxruntime_extensions.so
session_options:
$ref: '#/components/schemas/ONNXSessionOptionsGroup'
ONNXSessionOptionRequest:
type: object
nullable: true
Expand All @@ -287,11 +302,13 @@ components:
required: false
oneOf:
- type: boolean
description: Use CUDA
description: Enable CUDA with all defaults (device_id=0).
- type: integer
description: Enable CUDA on the given device_id.
- $ref: '#/components/schemas/ONNXSessionOptionCUDA'
input_shape:
type: object
description: Input shape
description: Input shape overrides keyed by input name.
nullable: false
required: false
example: {
Expand All @@ -301,25 +318,157 @@ components:
}
output_shape:
type: object
description: Output shape
description: Output shape overrides keyed by output name.
nullable: false
required: false
example: {
"output": [ 1, 1 ]
}
extensions:
type: array
description: |
One or more absolute paths to onnxruntime-extensions custom-ops libraries.
Each path is registered with ORT in array order; duplicate paths are deduplicated.
nullable: false
required: false
items:
type: string
example:
- /absolute/path/to/libonnxruntime_extensions.so
ortextensions_path:
type: string
description: To use the ONNXRuntime Extension (Custom Ops Library), you must provide the library path.
description: |
Deprecated alias for `extensions`. A single library path. The server normalizes
it into the `extensions` array on input and the response always echoes the
normalized form.
deprecated: true
nullable: false
required: false
example: /absolute/path/to/libonnxruntime_extensions
example: /absolute/path/to/libonnxruntime_extensions.so
session_options:
$ref: '#/components/schemas/ONNXSessionOptionsGroup'
ONNXSessionOptionCUDA:
type: object
description: |
CUDA Execution Provider V2 options. The server forwards every supplied key to
UpdateCUDAProviderOptions in a single batched call; if ORT rejects any key the
whole session creation fails with the ORT error message. The response is built
from GetCUDAProviderOptionsAsString readback, so it shows exactly what ORT
stored (which may differ from the requested value if ORT normalized it).
properties:
device_id:
type: integer
description: CUDA device ID
nullable: false
gpu_mem_limit:
type: integer
description: Per-session GPU memory limit, in bytes.
nullable: false
arena_extend_strategy:
type: string
description: Arena extension strategy, e.g. "kNextPowerOfTwo" or "kSameAsRequested".
nullable: false
cudnn_conv_algo_search:
type: string
description: cuDNN convolution algorithm search policy. Accepted values are ORT-defined enum names.
nullable: false
cudnn_conv_use_max_workspace:
type: boolean
nullable: false
do_copy_in_default_stream:
type: boolean
nullable: false
enable_cuda_graph:
type: boolean
description: Capture and replay a CUDA graph (requires static input shapes).
nullable: false
tunable_op_enable:
type: boolean
nullable: false
tunable_op_tuning_enable:
type: boolean
nullable: false
cudnn_conv1d_pad_to_nc1d:
type: boolean
nullable: false
additionalProperties:
description: |
Any additional CUDA Execution Provider V2 key understood by your ORT build is
forwarded as-is. Refer to the ORT CUDA EP documentation for the full list of
accepted keys.
ONNXSessionOptionsGroup:
type: object
description: |
Session-level options forwarded to onnxruntime SessionOptions. The server only
validates JSON shape (types and our enum-string mapping); ORT decides whether the
value itself is acceptable. Keys whose ORT setter throws are silently dropped from
the echoed response. The `config_entries` object is round-tripped through
GetSessionConfigEntry so the echo shows what ORT actually stored (always strings).
nullable: false
required: false
properties:
intra_op_num_threads:
type: integer
description: Number of threads used for parallelizing operators. 0 means ORT default.
nullable: false
inter_op_num_threads:
type: integer
description: Number of threads used for parallelizing the graph. 0 means ORT default.
nullable: false
execution_mode:
type: string
enum: [sequential, parallel]
nullable: false
graph_optimization_level:
type: string
enum: [disable, basic, extended, all]
nullable: false
enable_cpu_mem_arena:
type: boolean
nullable: false
enable_mem_pattern:
type: boolean
nullable: false
log_severity_level:
type: integer
description: ORT log severity level (0=verbose ... 4=fatal).
nullable: false
logid:
type: string
nullable: false
enable_profiling:
type: boolean
description: Enable profiling. When true, profile_file_prefix must also be supplied.
nullable: false
profile_file_prefix:
type: string
nullable: false
optimized_model_filepath:
type: string
description: Filepath where ORT writes the optimized model after graph transformations.
nullable: false
free_dimension_overrides:
type: object
description: Map of free dimension name to a fixed integer size.
additionalProperties:
type: integer
nullable: false
example:
batch: 1
config_entries:
type: object
description: |
Generic passthrough to AddSessionConfigEntry (e.g. "session.disable_prepacking").
Booleans and integers are stringified before being passed to ORT; values in the
response are always strings (round-tripped through GetSessionConfigEntry).
additionalProperties:
oneOf:
- type: string
- type: boolean
- type: integer
nullable: false
example:
session.disable_prepacking: "1"
ONNXSessionCreateRequest:
type: object
properties:
Expand Down
Loading
Loading