A sophisticated distributed inference benchmarking tool designed for NVIDIA Triton Inference Server. This project demonstrates expertise in distributed systems, deep learning optimization, and production-grade AI infrastructure.
- Distributed inference using Ray for parallel processing
- TensorRT FP16 optimization support
- Robust error handling and retry mechanisms
- Comprehensive performance metrics (latency, throughput, P95, P99)
- Real-time visualization of latency distribution
- Containerized deployment with Docker
- CI/CD pipeline with GitHub Actions
- Python 3.10+
- NVIDIA Triton Inference Server
- Ray for distributed computing
- TensorRT for optimized inference
- Docker for containerization
- GitHub Actions for CI/CD
- Average Latency (ms)
- P95/P99 Latency
- Throughput (inferences/second)
- Success/Error Rate
- Latency Distribution Visualization
- NVIDIA GPU with CUDA support
- Docker with NVIDIA Container Runtime
- Python 3.10+
- Clone the repository:
git clone https://github.com/yourusername/triton-inference-benchmark.git
cd triton-inference-benchmark
- Install dependencies:
pip install -r requirements.txt
- Build Docker image:
docker build -t triton-benchmark .
docker run --gpus all --network host triton-benchmark
python benchmark.py
The tool generates:
- JSON files with detailed metrics
- Latency distribution plots
- Console logs with key performance indicators
- Configurable number of concurrent requests
- Customizable retry mechanisms
- Support for different model architectures
- Real-time performance monitoring
- Distributed load generation
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License