-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766
Comments
I don't know what's going on in |
Is there any logging I could turn on that would help me determine if the slowdown is from face_recognition or dlib? What kind of timings SHOULD I be expecting? Is it safe to say 500-800ms is way too slow for GPU? |
It depends on your GPU, how big the image is, and what kinds of settings or way of running it is being used in that code. So I can't say. Could be anything. |
Ok, I appreciate the help. I opened a ticket with the face_recognition project once before and never got a reply. I'll give it another try. Thanks. |
After some debugging of the face_recognition code I've traced through it enough to see that the code is performing quickly up until the call to dlib.face_recognition_model_v1(face_recognition_model), where face_recognition_model points to the file dlib_face_recognition_resnet_model_v1.dat That call takes over 800ms to return on a Jetson Nano. Should I be calling a different function/model? |
Are you sure that model is running on CUDA? I am not familiar with the Jetson Nano speeds, but those timings you get look a lot like CPU inference. It should be really fast, since it's a slimmed version of ResNet34 that operates on 150×150 images. |
When I run tegrastats I see GPU usage at 99%, so as far as I can tell it's running on the GPU. I also made sure I compiled dlib with CUDA support. Also, if I don't specify that model="cnn" then the whole operation on CPU takes 500ms. |
Hmm, can you run the inference on the network twice in a row and measure only the second time? Maybe your measurements include the allocation on the GPU. |
I actually have my test program running the inference 30 times in a row in a loop. the first call takes a LONG time (16-20s) and then after that the time is consistently 840ms (give or take a millisecond or two). |
Oh, you mean the GPU utilization? it stays at 99% while the 30 inference loop runs. |
Then I don't know what's going on. For reference, I timed the inference of one image using the dlib C++ example dnn_face_recognition_ex on a 12th Gen Intel® Core™ i7-1260P × 16 CPU, and it takes about 50 ms. |
Another thing, are you certain it's that model that's causing the latency? Not the face detector? How big are your images? |
The test image is 585x388. How do I determine if it's the face detector or not? The dlib call that I put timing measurements around was the call to dlib.face_recognition_model_v1(face_recognition_model). Does that do the detection and the recognition or did I misunderstand this and miss another dlib call somewhere? That call was the one that took 840ms. |
No, you're right, I'm looking at the wrong function... dlib.cnn_face_detection_model_v1(cnn_face_detection_model) is the function that is taking 840ms. The model it is using is mmod_human_face_detector.dat |
Ah, that makes more sense. However, that image doesn't seem that big, though. Try downscaling the image (at the risk of not detecting the smaller faces, any face which is smaller than 80×80 pixels won't be detected). |
That did make a difference, dropping 840ms down to 160ms when I shrunk the file down to 1/3 of it's size, but it no longer detected any faces. Also, that still doesn't account for the face recognition that will have to be done. I'm confused as to why face detection/recognition is taking so much longer than something like human pose estimation, which I'm able to do on the Jetson in about 60ms. I've also been able to do face detection/recognition (using different code, admittedly) on a Raspberry Pi with Intel's Neural Compute Stick 2 in around 50-75ms as well, and the Jetson is many times more powerful than the NCS2. The detection model I'm using there is https://docs.openvino.ai/latest/omz_models_model_face_detection_retail_0004.html. |
Yeah, I don't know what's going on. That model should be really fast, even on relatively large images, since it only has 7 convolutional layers... There must be something else wrong. |
Can you just run the face detector model using the official dlib examples:
Maybe that other library you're using does something we are not aware of. Let's try to isolate the problem. |
Good idea. I will try both of those and let you. I appreciate all of the help. Thanks very much! |
cnn_face_detectory.py times ranged from 840ms to over 4s depending on the size of the image passed in. None of the measurements were under 840ms and all were utilizing the GPU. I'll try the C++ next, but this seems excessively long for GPU based face detection. Edit: C++ results were similar. |
Ok, so I ran the C++ example on a NVIDIA Quadro RTX 5000 and, with the default example, the second inference on each image (to avoid measuring the memory allocation) took about 250 ms, which might seem quite slow, at first. This is how I ran it:
However, if we look at the code, we can see that we're upscaling the images so that they have about 1800×1800 pixels. Which means we are doing inference on images that are about 4000×3000 pixels (they are actually larger because the network will create a tiled pyramid, but let's ignore that). If I change the code to infer on images that are about 900×900 pixels, the runtime goes down to 70 ms, and the images are about 2000×1500 pixels. For the images in the dlib examples, that is enough to detect all faces. If I try with 450×450 pixels, then the inference time goes down to 20 ms, but I start to have false negatives (not able to detect the smallest faces). So, for the images in that dataset, the optimal size is somewhere between 450×450 and 900×900. For reference, here are the modifications I made: diff --git a/examples/dnn_mmod_face_detection_ex.cpp b/examples/dnn_mmod_face_detection_ex.cpp
index 3cdf4fcc..92988540 100644
--- a/examples/dnn_mmod_face_detection_ex.cpp
+++ b/examples/dnn_mmod_face_detection_ex.cpp
@@ -88,7 +88,7 @@ int main(int argc, char** argv) try
// Upsampling the image will allow us to detect smaller faces but will cause the
// program to use more RAM and run longer.
- while(img.size() < 1800*1800)
+ while(img.size() < 900*900)
pyramid_up(img);
// Note that you can process a bunch of images in a std::vector at once and it runs
@@ -97,6 +97,10 @@ int main(int argc, char** argv) try
// the same size. To avoid this requirement on images being the same size we
// process them individually in this example.
auto dets = net(img);
+ const auto t0 = chrono::steady_clock::now();
+ dets = net(img);
+ const auto t1 = chrono::steady_clock::now();
+ cout << "size: " << img.nc() << "×" << img.nr() << ", elapsed: " << chrono::duration_cast<chrono::duration<float, milli>>(t1 - t0).count() << " ms\n";
win.clear_overlay();
win.set_image(img);
for (auto&& d : dets)
This means that you need to find a trade-off between speed and accuracy for your use case. |
I'll try calling dlib directly then and see. Just out of curiosity what command do you use to rebuild the example code without rebuilding everything. |
Assuming you are at the top-level directory of the dlib repository: cd examples
cmake -B build -G Ninja
cmake --build build -t dnn_mmod_face_detection_ex If you don't specify |
I tried the C++ example on a Jetson Nano and got similar results. |
There is also an issue(cuda not dlib) that if you just upscale instead of doing a fixed size as target (I know letterbox and stuff) you need to relocate cuda mem. which makes it nearly as slow as the notorious first pass on cuda |
So I've noticed the slowdown with cuda and needing to relocate when image sizes are different. I'm processing through a bunch of images, and it's easy for me to sort them in size order. My question is: Is in Python - is there a way to release the memory held in cuda, without just exiting out? |
Letting all the objects go out of scope or deleting them will free the memory. But the CUDA runtime itself likes to hold onto memory and as far as I am aware there isn't any way to tell it to not do that. |
@davisking thanks. I did realize that letting it go out of scope clears most of it, and seems to work most of the time. If it hits an out of memory error and throws, it seems that it holds it forever (need to exit python). Is there a way to calculate what the largest size image it can take without maxing out? I've noticed that if I shrink large images (4000x3000) down a bit (0.625) and then upscale it by 1 (when calling the detector), it is able to find more faces then keeping the image at its native size and not upscaling. I don't particularly care about speed, just trying to optimize for the best results. |
I don't have any smarter way on hand to do it other than to test out a few
sizes and measure to see what you get.
…On Mon, Oct 30, 2023 at 12:29 AM Justin Myers ***@***.***> wrote:
@davisking <https://github.com/davisking> thanks.
I did realize that letting it go out of scope clears most of it, and seems
to work most of the time. If it hits an out of memory error and throws, it
seems that it holds it forever (need to exit python).
Is there a way to calculate what the largest size image it can take
without maxing out?
I've noticed that if I shrink large images (4000x3000) down a bit (0.625)
and then upscale it by 1 (when calling the detector), it is able to find
more faces then keeping the image at its native size and not upscaling.
I don't particularly care about speed, just trying to optimize for the
best results.
—
Reply to this email directly, view it on GitHub
<#2766 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPYFR4A757NH5UQPYQIQL3YB4ULTAVCNFSM6AAAAAAW3APZJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBUGQ3DCNBQG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What Operating System(s) are you seeing this problem on?
Other (plase, specify in the Steps to Reproduce)
dlib version
19.24
Python version
3.6
Compiler
gcc 7.5
Expected Behavior
I am attempting to create a facial detection/recognition component for a system I'm working on and I am unable to get dlib/face_recognition to perform at better than 2FPS under any circumstances.
The systems is running on a Jetson Nano 4G running Ubuntu 18.04 with Jetpack 4.6 installed.
I built dlib from scratch (using this helper script: https://github.com/JpnTr/Jetson-Nano-Install-Dlib-Library) and verified that suggested Jetson specific patches were made (as per https://medium.com/@ageitgey/build-a-hardware-based-face-recognition-system-for-150-with-the-nvidia-jetson-nano-and-python-a25cb8c891fd).
The test code I am running (against a single picture at 585x388 resolution with 5 people in it) looks like:
`#!/usr/bin/python3.6
import face_recognition
import time
def current_milli_time():
return round(time.time() * 1000)
for i in range(0,30):
t1=current_milli_time()
image = face_recognition.load_image_file("humans_1.jpg")
t2=current_milli_time()
face_locations = face_recognition.face_locations(image, model="cnn")
t3=current_milli_time()
print(face_locations)
print("load: ", t2-t1 )
print("detect: ",t3-t2)
print("Total: ", t3-t1)`
With no model specified (so the CPU is being used I believe) the normal face detection time is about 500ms, give or take. When I specify model="cnn" that number actually INCREASES to over 800ms.
tegrastats verifies that my GPU utilization is 99%.
I've seen this issue reported by other people but I have yet to see a solution. Shouldn't this be a reasonably fast operation (under 100ms) on a GPU? I've seen other (c/c++ based) face detection methods that suggest that detection can take as little as 20-50ms.
Current Behavior
Current behavior is that face detection takes 500ms on the CPU and even longer (800+ms) when using CUDA/GPU.
Steps to Reproduce
Nothing fancy, just run the code I provided.
Anything else?
No response
The text was updated successfully, but these errors were encountered: