Skip to content

WaffleBits/triton-inference-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Inference Benchmarking Tool

🚀 Overview

A sophisticated distributed inference benchmarking tool designed for NVIDIA Triton Inference Server. This project demonstrates expertise in distributed systems, deep learning optimization, and production-grade AI infrastructure.

✨ Key Features

  • Distributed inference using Ray for parallel processing
  • TensorRT FP16 optimization support
  • Robust error handling and retry mechanisms
  • Comprehensive performance metrics (latency, throughput, P95, P99)
  • Real-time visualization of latency distribution
  • Containerized deployment with Docker
  • CI/CD pipeline with GitHub Actions

🛠 Technical Stack

  • Python 3.10+
  • NVIDIA Triton Inference Server
  • Ray for distributed computing
  • TensorRT for optimized inference
  • Docker for containerization
  • GitHub Actions for CI/CD

📊 Performance Metrics

  • Average Latency (ms)
  • P95/P99 Latency
  • Throughput (inferences/second)
  • Success/Error Rate
  • Latency Distribution Visualization

🔧 Setup & Installation

Prerequisites

- NVIDIA GPU with CUDA support
- Docker with NVIDIA Container Runtime
- Python 3.10+

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/triton-inference-benchmark.git
cd triton-inference-benchmark
  1. Install dependencies:
pip install -r requirements.txt
  1. Build Docker image:
docker build -t triton-benchmark .

🚀 Usage

Running with Docker

docker run --gpus all --network host triton-benchmark

Running locally

python benchmark.py

📈 Output

The tool generates:

  • JSON files with detailed metrics
  • Latency distribution plots
  • Console logs with key performance indicators

🔍 Advanced Features

  • Configurable number of concurrent requests
  • Customizable retry mechanisms
  • Support for different model architectures
  • Real-time performance monitoring
  • Distributed load generation

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

MIT License

🔗 Links

About

Distributed Inference Benchmarking Tool for NVIDIA Triton Server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published