[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

marcjasner · 2023-04-12T01:56:09Z

What Operating System(s) are you seeing this problem on?

Other (plase, specify in the Steps to Reproduce)

dlib version

19.24

Python version

3.6

Compiler

gcc 7.5

Expected Behavior

I am attempting to create a facial detection/recognition component for a system I'm working on and I am unable to get dlib/face_recognition to perform at better than 2FPS under any circumstances.

The systems is running on a Jetson Nano 4G running Ubuntu 18.04 with Jetpack 4.6 installed.

I built dlib from scratch (using this helper script: https://github.com/JpnTr/Jetson-Nano-Install-Dlib-Library) and verified that suggested Jetson specific patches were made (as per https://medium.com/@ageitgey/build-a-hardware-based-face-recognition-system-for-150-with-the-nvidia-jetson-nano-and-python-a25cb8c891fd).

The test code I am running (against a single picture at 585x388 resolution with 5 people in it) looks like:

`#!/usr/bin/python3.6

import face_recognition
import time

def current_milli_time():
return round(time.time() * 1000)

for i in range(0,30):
t1=current_milli_time()
image = face_recognition.load_image_file("humans_1.jpg")
t2=current_milli_time()
face_locations = face_recognition.face_locations(image, model="cnn")
t3=current_milli_time()

print(face_locations)
print("load: ", t2-t1 )
print("detect: ",t3-t2)
print("Total: ", t3-t1)`

With no model specified (so the CPU is being used I believe) the normal face detection time is about 500ms, give or take. When I specify model="cnn" that number actually INCREASES to over 800ms.

tegrastats verifies that my GPU utilization is 99%.

I've seen this issue reported by other people but I have yet to see a solution. Shouldn't this be a reasonably fast operation (under 100ms) on a GPU? I've seen other (c/c++ based) face detection methods that suggest that detection can take as little as 20-50ms.

Current Behavior

Current behavior is that face detection takes 500ms on the CPU and even longer (800+ms) when using CUDA/GPU.

Steps to Reproduce

Nothing fancy, just run the code I provided.

Anything else?

No response

davisking · 2023-04-13T02:37:39Z

I don't know what's going on in face_recognition so I can't say (it's not a dlib package). Maybe there is a surprise in there.

marcjasner · 2023-04-13T02:40:39Z

Is there any logging I could turn on that would help me determine if the slowdown is from face_recognition or dlib?

What kind of timings SHOULD I be expecting? Is it safe to say 500-800ms is way too slow for GPU?

davisking · 2023-04-13T02:42:27Z

It depends on your GPU, how big the image is, and what kinds of settings or way of running it is being used in that code. So I can't say. Could be anything.

marcjasner · 2023-04-13T02:44:53Z

Ok, I appreciate the help. I opened a ticket with the face_recognition project once before and never got a reply. I'll give it another try. Thanks.

marcjasner · 2023-04-14T02:10:35Z

After some debugging of the face_recognition code I've traced through it enough to see that the code is performing quickly up until the call to dlib.face_recognition_model_v1(face_recognition_model), where face_recognition_model points to the file dlib_face_recognition_resnet_model_v1.dat

That call takes over 800ms to return on a Jetson Nano. Should I be calling a different function/model?

arrufat · 2023-04-14T02:18:07Z

Are you sure that model is running on CUDA? I am not familiar with the Jetson Nano speeds, but those timings you get look a lot like CPU inference. It should be really fast, since it's a slimmed version of ResNet34 that operates on 150×150 images.

marcjasner · 2023-04-14T02:22:17Z

When I run tegrastats I see GPU usage at 99%, so as far as I can tell it's running on the GPU. I also made sure I compiled dlib with CUDA support.

Also, if I don't specify that model="cnn" then the whole operation on CPU takes 500ms.

arrufat · 2023-04-14T02:28:51Z

Hmm, can you run the inference on the network twice in a row and measure only the second time? Maybe your measurements include the allocation on the GPU.

marcjasner · 2023-04-14T02:30:00Z

I actually have my test program running the inference 30 times in a row in a loop. the first call takes a LONG time (16-20s) and then after that the time is consistently 840ms (give or take a millisecond or two).

marcjasner · 2023-04-14T02:30:37Z

Oh, you mean the GPU utilization? it stays at 99% while the 30 inference loop runs.

arrufat · 2023-04-14T02:36:54Z

Then I don't know what's going on.

For reference, I timed the inference of one image using the dlib C++ example dnn_face_recognition_ex on a 12th Gen Intel® Core™ i7-1260P × 16 CPU, and it takes about 50 ms.

arrufat · 2023-04-14T02:38:56Z

Another thing, are you certain it's that model that's causing the latency? Not the face detector? How big are your images?

marcjasner · 2023-04-14T02:42:47Z

The test image is 585x388. How do I determine if it's the face detector or not? The dlib call that I put timing measurements around was the call to dlib.face_recognition_model_v1(face_recognition_model). Does that do the detection and the recognition or did I misunderstand this and miss another dlib call somewhere? That call was the one that took 840ms.

marcjasner · 2023-04-14T02:46:26Z

No, you're right, I'm looking at the wrong function... dlib.cnn_face_detection_model_v1(cnn_face_detection_model) is the function that is taking 840ms. The model it is using is mmod_human_face_detector.dat

arrufat · 2023-04-14T02:59:58Z

Ah, that makes more sense. However, that image doesn't seem that big, though.
That model is creating an image pyramid which will have roughly 4 times the number of pixels of the original image (see: https://blog.dlib.net/2017/08/vehicle-detection-with-dlib-195_27.html).

Try downscaling the image (at the risk of not detecting the smaller faces, any face which is smaller than 80×80 pixels won't be detected).

marcjasner · 2023-04-14T03:12:21Z

That did make a difference, dropping 840ms down to 160ms when I shrunk the file down to 1/3 of it's size, but it no longer detected any faces. Also, that still doesn't account for the face recognition that will have to be done. I'm confused as to why face detection/recognition is taking so much longer than something like human pose estimation, which I'm able to do on the Jetson in about 60ms.

I've also been able to do face detection/recognition (using different code, admittedly) on a Raspberry Pi with Intel's Neural Compute Stick 2 in around 50-75ms as well, and the Jetson is many times more powerful than the NCS2. The detection model I'm using there is https://docs.openvino.ai/latest/omz_models_model_face_detection_retail_0004.html.

arrufat · 2023-04-14T03:15:00Z

Yeah, I don't know what's going on. That model should be really fast, even on relatively large images, since it only has 7 convolutional layers... There must be something else wrong.

arrufat · 2023-04-14T03:17:29Z

Can you just run the face detector model using the official dlib examples:

C++: dnn_mmod_face_detection_ex
Python: cnn_face_detector.py

Maybe that other library you're using does something we are not aware of. Let's try to isolate the problem.

marcjasner · 2023-04-14T03:26:59Z

Good idea. I will try both of those and let you. I appreciate all of the help. Thanks very much!

marcjasner · 2023-04-16T01:24:14Z

cnn_face_detectory.py times ranged from 840ms to over 4s depending on the size of the image passed in. None of the measurements were under 840ms and all were utilizing the GPU. I'll try the C++ next, but this seems excessively long for GPU based face detection.

Edit: C++ results were similar.

arrufat · 2023-04-16T06:48:43Z

Ok, so I ran the C++ example on a NVIDIA Quadro RTX 5000 and, with the default example, the second inference on each image (to avoid measuring the memory allocation) took about 250 ms, which might seem quite slow, at first. This is how I ran it:

dnn_mmod_face_detection_ex mmod_human_face_detector.dat faces/*jpg

However, if we look at the code, we can see that we're upscaling the images so that they have about 1800×1800 pixels. Which means we are doing inference on images that are about 4000×3000 pixels (they are actually larger because the network will create a tiled pyramid, but let's ignore that).

If I change the code to infer on images that are about 900×900 pixels, the runtime goes down to 70 ms, and the images are about 2000×1500 pixels. For the images in the dlib examples, that is enough to detect all faces.

If I try with 450×450 pixels, then the inference time goes down to 20 ms, but I start to have false negatives (not able to detect the smallest faces). So, for the images in that dataset, the optimal size is somewhere between 450×450 and 900×900.

For reference, here are the modifications I made:

diff --git a/examples/dnn_mmod_face_detection_ex.cpp b/examples/dnn_mmod_face_detection_ex.cpp
index 3cdf4fcc..92988540 100644
--- a/examples/dnn_mmod_face_detection_ex.cpp
+++ b/examples/dnn_mmod_face_detection_ex.cpp
@@ -88,7 +88,7 @@ int main(int argc, char** argv) try
 
         // Upsampling the image will allow us to detect smaller faces but will cause the
         // program to use more RAM and run longer.
-        while(img.size() < 1800*1800)
+        while(img.size() < 900*900)
             pyramid_up(img);
 
         // Note that you can process a bunch of images in a std::vector at once and it runs
@@ -97,6 +97,10 @@ int main(int argc, char** argv) try
         // the same size.  To avoid this requirement on images being the same size we
         // process them individually in this example.
         auto dets = net(img);
+        const auto t0 = chrono::steady_clock::now();
+        dets = net(img);
+        const auto t1 = chrono::steady_clock::now();
+        cout << "size: " << img.nc() << "×" << img.nr() << ", elapsed: " << chrono::duration_cast<chrono::duration<float, milli>>(t1 - t0).count() << " ms\n"; 
         win.clear_overlay();
         win.set_image(img);
         for (auto&& d : dets)

This means that you need to find a trade-off between speed and accuracy for your use case.
I don't know what the library you're using is doing under the hood, but I would just call the detector myself, and this way you'll be sure of what's actually happening in your program.

marcjasner · 2023-04-16T14:28:06Z

I'll try calling dlib directly then and see. Just out of curiosity what command do you use to rebuild the example code without rebuilding everything.

arrufat · 2023-04-16T14:55:13Z

Assuming you are at the top-level directory of the dlib repository:

cd examples
cmake -B build -G Ninja
cmake --build build -t dnn_mmod_face_detection_ex

If you don't specify -G Ninja, CMake will use your OS's default build system.

facug91 · 2023-07-16T00:57:12Z

I tried the C++ example on a Jetson Nano and got similar results.
I first tried running it as is, with images fromexamples/faces, but the large number of pyramid up made it impossible for it to work without exploding.
I then modified it to work like the python example, with just one pyramid up, and was able to run many examples, but not the biggest one.
So I decided to resize each image to the same resolution you were testing (585, 388), before the pyramid up, and got 477 ms on average, similar to your results, and found almost all faces except two really small ones. Without CUDA, I got ~25 seconds, so we can be sure you were using the GPU.
Without any pyramid up, it ran in ~118 ms, but detected almost no faces.
My conclusion is that it's not a bug, it's how it works on Jetson Nano devices.
Anyway, I think the model could be optimized to be smaller and work just as well, but it would require some training and testing.

Compaile · 2023-07-18T13:42:07Z

There is also an issue(cuda not dlib) that if you just upscale instead of doing a fixed size as target (I know letterbox and stuff) you need to relocate cuda mem. which makes it nearly as slow as the notorious first pass on cuda

justmobilize · 2023-10-28T19:30:34Z

So I've noticed the slowdown with cuda and needing to relocate when image sizes are different. I'm processing through a bunch of images, and it's easy for me to sort them in size order. My question is:

Is in Python - is there a way to release the memory held in cuda, without just exiting out?

davisking · 2023-10-29T22:36:52Z

Letting all the objects go out of scope or deleting them will free the memory. But the CUDA runtime itself likes to hold onto memory and as far as I am aware there isn't any way to tell it to not do that.

justmobilize · 2023-10-30T04:29:34Z

@davisking thanks.

I did realize that letting it go out of scope clears most of it, and seems to work most of the time. If it hits an out of memory error and throws, it seems that it holds it forever (need to exit python).

Is there a way to calculate what the largest size image it can take without maxing out?

I've noticed that if I shrink large images (4000x3000) down a bit (0.625) and then upscale it by 1 (when calling the detector), it is able to find more faces then keeping the image at its native size and not upscaling.

I don't particularly care about speed, just trying to optimize for the best results.

davisking · 2023-10-31T11:54:56Z

I don't have any smarter way on hand to do it other than to test out a few sizes and measure to see what you get.

…

On Mon, Oct 30, 2023 at 12:29 AM Justin Myers ***@***.***> wrote: @davisking <https://github.com/davisking> thanks. I did realize that letting it go out of scope clears most of it, and seems to work most of the time. If it hits an out of memory error and throws, it seems that it holds it forever (need to exit python). Is there a way to calculate what the largest size image it can take without maxing out? I've noticed that if I shrink large images (4000x3000) down a bit (0.625) and then upscale it by 1 (when calling the detector), it is able to find more faces then keeping the image at its native size and not upscaling. I don't particularly care about speed, just trying to optimize for the best results. — Reply to this email directly, view it on GitHub <#2766 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPYFR4A757NH5UQPYQIQL3YB4ULTAVCNFSM6AAAAAAW3APZJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBUGQ3DCNBQG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

marcjasner added the bug label Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

marcjasner commented Apr 12, 2023

davisking commented Apr 13, 2023 •

edited

Loading

marcjasner commented Apr 13, 2023

davisking commented Apr 13, 2023

marcjasner commented Apr 13, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023 •

edited

Loading

marcjasner commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

marcjasner commented Apr 14, 2023 •

edited

Loading

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

marcjasner commented Apr 16, 2023 •

edited

Loading

arrufat commented Apr 16, 2023

marcjasner commented Apr 16, 2023

arrufat commented Apr 16, 2023

facug91 commented Jul 16, 2023

Compaile commented Jul 18, 2023 •

edited

Loading

justmobilize commented Oct 28, 2023 •

edited

Loading

davisking commented Oct 29, 2023

justmobilize commented Oct 30, 2023

davisking commented Oct 31, 2023 via email

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

[Bug]: Jetson Nano Slow Performance With GPU/CUDA #2766

Comments

marcjasner commented Apr 12, 2023

What Operating System(s) are you seeing this problem on?

dlib version

Python version

Compiler

Expected Behavior

Current Behavior

Steps to Reproduce

Anything else?

davisking commented Apr 13, 2023 • edited Loading

marcjasner commented Apr 13, 2023

davisking commented Apr 13, 2023

marcjasner commented Apr 13, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023 • edited Loading

marcjasner commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

marcjasner commented Apr 14, 2023 • edited Loading

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

arrufat commented Apr 14, 2023

arrufat commented Apr 14, 2023

marcjasner commented Apr 14, 2023

marcjasner commented Apr 16, 2023 • edited Loading

arrufat commented Apr 16, 2023

marcjasner commented Apr 16, 2023

arrufat commented Apr 16, 2023

facug91 commented Jul 16, 2023

Compaile commented Jul 18, 2023 • edited Loading

justmobilize commented Oct 28, 2023 • edited Loading

davisking commented Oct 29, 2023

justmobilize commented Oct 30, 2023

davisking commented Oct 31, 2023 via email

davisking commented Apr 13, 2023 •

edited

Loading

arrufat commented Apr 14, 2023 •

edited

Loading

marcjasner commented Apr 14, 2023 •

edited

Loading

marcjasner commented Apr 16, 2023 •

edited

Loading

Compaile commented Jul 18, 2023 •

edited

Loading

justmobilize commented Oct 28, 2023 •

edited

Loading