A C++ Implementation of YoloV9 using TensorRT
Supports object detection.
video.mp4
YOLOv9 (top) vs YOLOv8 (bottom)
yolov9-e-converted vs yolov8x
This project is actively seeking maintainers to help guide its growth and improvement. If you're passionate about this project and interested in contributing, I’d love to hear from you!
Please feel free to reach out via LinkedIn to discuss how you can get involved.
This project demonstrates how to use the TensorRT C++ API to run GPU inference for YoloV9. It makes use of my other project tensorrt-cpp-api to run inference behind the scene, so make sure you are familiar with that project.
- Tested and working on Ubuntu 20.04 & 22.04 (Windows is not supported at this time)
- Install CUDA, instructions here.
- Recommended >= 12.0
- Install cudnn, instructions here.
- Recommended >= 8
sudo apt install build-essential
sudo apt install python3-pip
pip3 install cmake
- Install OpenCV with cuda support. To compile OpenCV from source, run the
build_opencv.sh
script provided here.- Recommended >= 4.8
- Download TensorRT 10 from here.
- Required >= 10.0
- Extract, and then navigate to the
CMakeLists.txt
file and replace theTODO
with the path to your TensorRT installation.
git clone https://github.com/cyrusbehr/YOLOv9-TensorRT-CPP --recursive
- Note: Be sure to use the
--recursive
flag as this repo makes use of git submodules.
- Navigate to the official YoloV9 repository and download your desired version of the model (ex. YOLOv9-C).
- Clone the official YoloV9 repository.
- From within the YoloV9 repository, run the following:
python3 export.py --weights <path to your pt file> --include onnx
- After running this command, you should successfully have converted from PyTorch to ONNX.
- Note: If converting the model using a different script, be sure that
end2end
is disabled. This flag will add bbox decoding and nms directly to the model, whereas my implementation does these steps external to the model using good old C++. - Move the export onnx model into the
YOLOv9-TensorRT-CPP/modles/
directory.
mkdir build
cd build
cmake ..
make -j
- Note: the first time you run any of the scripts, it may take quite a long time (5 mins+) as TensorRT must generate an optimized TensorRT engine file from the onnx model. This is then saved to disk and loaded on subsequent runs.
- To run the benchmarking script, run:
./benchmark --model /path/to/your/onnx/model.onnx --input /path/to/your/benchmark/image.png
- To run inference on an image and save the annotated image to disk run:
./detect_object_image --model /path/to/your/onnx/model.onnx --input /path/to/your/image.jpg
- You can use the images in the
images/
directory for testing
- You can use the images in the
- To run inference using your webcam and display the results in real time, run:
./detect_object_video --model /path/to/your/onnx/model.onnx --input 0
- For a full list of arguments, run any of the executables without providing any arguments.
Enabling INT8 precision can further speed up inference at the cost of accuracy reduction due to reduced dynamic range. For INT8 precision, calibration data must be supplied which is representative of real data the model will see. It is advised to use 1K+ calibration images. To enable INT8 inference with the YoloV8 sanity check model, the following steps must be taken:
- Download and extract the COCO validation dataset, or procure data representative of your inference data:
wget http://images.cocodataset.org/zips/val2017.zip
- Provide the additional command line arguments when running the executables:
--precision INT8 --calibration-data /path/to/your/calibration/data
- If you get an "out of memory in function allocate" error, then you must reduce
Options.calibrationBatchSize
so that the entire batch can fit in your GPU memory.
- Before running benchmarks, ensure your GPU is unloaded.
- Run the executable
benchmark
using the/images/640_640.jpg
image. - If you'd like to benchmark each component (
preprocess
,inference
,postprocess
), recompile setting theENABLE_BENCHMARKS
flag toON
:cmake -DENABLE_BENCHMARKS=ON ..
.- You can then rerun the executable
Benchmarks run on NVIDIA GeForce RTX 3080 Laptop GPU, Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz using 640x640 BGR image in GPU memory and FP16 precision.
Model | Precision | Total Time | Preprocess Time | Inference Time | Postprocess Time |
---|---|---|---|---|---|
yolov9-e-converted | FP32 | 27.745 ms | 0.091 ms | 25.293 ms | 2.361 ms |
yolov9-e-converted | FP16 | 12.74 ms | 0.085 ms | 10.167 ms | 2.488 ms |
yolov9-e-converted | INT8 | 10.775 ms | 0.084 ms | 8.285 ms | 2.406 ms |
TODO: Need to improve postprocessing time using CUDA kernel.
- If you have issues creating the TensorRT engine file from the onnx model, navigate to
libs/tensorrt-cpp-api/src/engine.cpp
and change the log level by changing the severity level tokVERBOSE
and rebuild and rerun. This should give you more information on where exactly the build process is failing.
If this project was helpful to you, I would appreicate if you could give it a star. That will encourage me to ensure it's up to date and solve issues quickly.