Triton is a Inference Server enabling teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.
This is the tree summary of all Triton documents but main documentation is here.
Details
This tutorial provide a quick start process for beginners to start using triton through many examples and do not goes into too much technical details.
Discussion
User guide
Architecture
Decoupled models
Model configuration
Model management
Model repository
Ragged batching
Optimization
Perf analyzer
Performance tuning
Customization guide
Perf Analyzer
GRPC Protocol
Ensemble image client
GRPC client
GRPC byte content client
GRPC explicit int8 content client
GRPC explicit int content client
GPRC image client
Image client
Memory growth test
Reuse infer objects client
Simple GRPC AIO infer client
Simple GRPC AIO sequence stream
Simple GRPC async infer client
Simple GRPC cudashm client
Simple GRPC custom args client
Simple GRPC custom repeat
Simple GRPC health metadata
Simple GRPC infer client
Simple GRPC keepalive client
Simple GRPC model control
GRPC sequence stream infer client
GRPC sequence sync infer client
Simple GRPC shm client
Simple GRPC shm string client
Simple GRPC string infer client
HTTP Protocol
Simple HTTP aio infer client
Simple HTTP async infer client
Simple HTTP cudashm client
Simple HTTP health metadata
Simple HTTP infer client
Simple HTTP model control
Simple HTTP sequence sync infer client
Simple HTTP shm client
Simple HTTP shm string client
Simple HTTP string infer client
Details
Install
CLI
Config
Config search
Ensemble quick start
Kubernets Deploy
Launch mode
Metrics
Reports
Multi-model quick start
BLS model quick start
Checkpointing in Model Anlayzer
An inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs. The Triton Model Navigator streamlines the process of moving models and pipelines implemented in PyTorch, TensorFlow, and/or ONNX to TensorRT.
A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing).
Python Backend
Business logic scripting
Add sub
BLS decoupled
Custom metrics
Onnx Runtime Backend
ONNX runtime with TensoRT EP
ONNX runtime with CUDA EP
ONNX runtime with OpenVino
Other optimization with ONNX
TensorRT Backend
Intro to Notebooks
Semantic segmentation
Deploy to Triton
Quantization tutorial
Torch-TensorRT with Triton
Dali Backend
Training to inference
Examples
Dali plugin
Efficient net
Inception ensemble
Perf Analyzer
ResNet50 TRT
Pytorch Backend
Paddle Paddle Backend
Fast Transformer Backend
A Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python.
Quick Start
Pull nividia Triton server image
docker pull nvcr.io/nvidia/tritonserver:24.06-py3
Pull nvidia Triton client for inference
docker pull nvcr.io/nvidia/tritonserver:24.06-py3-sdk
Create and run container for Triton server
docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/dev/triton/model-repository:/models nvcr.io/nvidia/tritonserver:24.06-py3 tritonserver --model-repository=/models
Recommend using docker-compose file
services:
triton-server:
image: nvcr.io/nvidia/tritonserver:24.06-py3
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: tritonserver --model-repository=/models --model-control-mode=explicit --load-model=densenet_onnx
ports:
- "8000:8000"
- "8001:8001"
- "8002:8002"
volumes:
- ../model_repository:/models
environment:
- NVIDIA_VISIBLE_DEVICES=1
Then enter bash or attach shell in vscode for tracking logging
docker-compose ps
docker-compose exec triton-server bash
Create and run container for Triton client:
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk
Then in the bash, run the premade file image_client
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
However, we can also create our own client with own http
or grpc
protocols:
import numpy as np
import requests
import json
# Define the server URL
url = "http://localhost:8000/v2/models/densenet_onnx/infer"
# Create input data (example: an array of zeros)
input_data = np.zeros((3, 224, 224), dtype=np.float32)
# Prepare the data in JSON format
inputs = [
{
"name": "data_0",
"shape": input_data.shape,
"datatype": "FP32",
"data": input_data.tolist()
}
]
outputs = [
{
"name": "fc6_1"
}
]
request_payload = {
"inputs": inputs,
"outputs": outputs
}
# Send the request to the Triton server
response = requests.post(url, json=request_payload)
# Check the response status
if response.status_code == 200:
response_json = response.json()
print(response_json.keys())
output_data = np.array(response_json["outputs"][0]["data"]).reshape(response_json["outputs"][0]["shape"])
print("Output Data: ", output_data)
else:
print("Request failed with status code: ", response.status_code)
print("Response: ", response.text)
Then run this docker-compose.yml in client directory:
services:
triton-client:
image: nvcr.io/nvidia/tritonserver:24.06-py3-sdk
network_mode: host
tty: true
stdin_open: true
restart: unless-stopped
volumes:
- ../:/workspace/inference/
Create the output dir first to avoid error
mkdir output_model/output
Run the triton server with above docker compose file. And now run the container for it to automatically connect to triton server
docker run -it --gpus all -v /var/run/docker.sock:/var/run/docker.sock -v d/Documents/GitHub/MY-REPO/triton/model_analyzer:/workspace/model_analyzer --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk
Now in the containter/machine, we run triton model analyzer with this:
model-analyzer profile \
--model-repository /workspace/model-analyzer/ \
--profile-models densenet_onnx --triton-launch-mode=remote \
--output-model-repository-path /workspace/model-analyzer/model-output/output \
--export-path /workspace/model_analyzer/profile_results \
--override-output-model-repository
If you just want to test with limit experiments, use this:
--run-config-search-max-concurrency 2
--run-config-search-max-model-batch-size 2
--run-config-search-max-instance-count 2