-
Notifications
You must be signed in to change notification settings - Fork 233
Optimize for multiple streams (drop frames, reduce delays, reduce memory usage) #257
Comments
Hi @mfoglio
Any time you start optimizing a SW which utilizes NVIDIA HW, a good place to start is Below you can see a screenshot of application which is clearly bottleneck-ed by CUDA cores performance: As long as your application isn't limited by Nvdec performance, just decode all video frames one by one and discard frames you don't need. Also, I don't recommend you to use single-threaded approach if you're aiming for top performance. Split it in 2 threads:
I can only help you to track the memory allocations happening in VPF, not in PyTorch. |
Thank you @rarzumanyan for the |
Hi @mfoglio
There's no need to guess, there are lots of VPF profiling options:
Each frame when decoded has PTS which is presentation timestamp. It increases monotonically and by it's value you can estimate how "fresh" decoded frame is. Take a look at #253 for more information on this topic.
SW design is a topic of it's own so I can't help you with anything more substantial than advice, but there are ways to mitigate this problem. E. g. a signal / slot connection between your consumer and producer. Since |
Thank you for your detailed response. |
Hi @mfoglio Generally, the entry point to any investigation is the same - compile VPF with all diagnostic options possible and use existing CUDA profiling tools. E. g. Nsight Systems profiler can track all the CUDA API calls, and VPF uses that to allocate memory for video frames. Hence, by looking at the application timeline, you will see exactly what’s happening and when. Sometimes Nsight struggles to collect application timeline for multithreaded Python scripts so a simpler decoding script (such as one of VPF samples) is probably a good place to start. |
Hello @rarzumanyan , could you provide more details about the difference between |
Hi @mfoglio Nvdec is async by it’s nature and there’s a delay between encoded frame submission and moment it’s ready for display. This latency is hidden when However, one can use external demuxer like Gstreamer, PyAV or any other demuxer which produces Annex.B elementary bitstream. In such case,
Regarding |
Hello @rarzumanyan , this is really interesting! I really appreciate your help. I have a few things in mind to try, as well as a few other questions... Sorry! And thanks! Is there any way to set a maximum size for the queue that you mentioned above? This way I could avoid "wasting" GPU memory by keeping in memory frames that I would still drop later on because of a slow consumer. Otherwise I guess I could use a standalone ffmpeg demuxer and I could drop packets until my consumer is ready; at that point I could resume decoding packets using It seems that Thank you, thank you, thank you! |
Hi @mfoglio,
There are 2 places where the memory for decoded surfaces is allocated. First is decoded surfaces pool size: VideoProcessingFramework/PyNvCodec/inc/PyNvCodec.hpp Lines 241 to 247 in ba47dca
You can slightly reduce the memory consumption by changing the Second is the decoder initialization stage: VideoProcessingFramework/PyNvCodec/TC/src/NvDecoder.cpp Lines 186 to 194 in ba47dca
You can get a better estimation of required surfaces amount by going through
VPF accepts a dictionary that is converted to VideoProcessingFramework/PyNvCodec/TC/src/FFmpegDemuxer.cpp Lines 406 to 418 in ba47dca
|
Thanks @rarzumanyan .
At the moment I am initializing the decoder with:
How can I provide the parameter Possible OT: it seems that the demuxer keeps disconnecting from the rtsp stream. The following is a log captured in about a minute:
It ended with:
However, ffprobe seems to find the stream:
Sometime the demuxer
If instead I try to initialize the decoder with |
Hi @mfoglio
This happens because pool size isn't exported to Python land, you have to change the hard-coded value in C++ and recompile VPF. Honestly, queue size was was never exported to Python simply because nobody has ever asked ))
Reading input from RTSP cameras is the single most painful thing to do. |
EDIT: the gstreamer pipeline seems to work. It wasn't because of a typo. The ffmpeg pipeline does not work. Hello @rarzumanyan , and thanks again for following me through my journey. Code to reproduce the issue:
Output for ffmpeg:
I am looking for a public rtsp that you can use to replicate on your side (or to check that the code works on some streams on my side).
I could test that the streams works fine with a:
Also, if I write the ffmpeg args to the console, the console start printing binary data:
|
This is expected behavior.
It reads data from RTSP source and applies Nvdec HW can't work with video containers like AVI, MKV or any other. It expects Annex.B elementary stream which consists of NAL Units. This is binary input which only contains compressed video without any extra information (like audio track or subtitles) because video codec standards only describes the video coding essentials and don't cover any video containers. |
@rarzumanyan yes, I added that comment to confirm that ffmpeg was actually working on my machine. So it seems that ffmpeg returns data but the decoder cannot parse any frame. |
Let's start from something simpler. |
I saved about a minute of video using EDIT: not sure if it's useful,. but here's the output of ffmpeg when saving the video (I stopped it with a Ctrl + C):
|
@rarzumanyan I can confirm that the gstreamer pipeline works. There was a typo (missing space). The FFmpeg pipeline does not work but that's not an issue as long as gstreamer works. Going back to the memory optimization, can instances of |
Hi @mfoglio
There are 2 types of constructors for most VPF classes which use CUDA:
In such case, given CUDA context and stream references will be saved within class instance and used later on when doing CUDA stuff. You can either rely on VideoProcessingFramework/SampleDemuxDecode.py Lines 47 to 59 in ba47dca
VideoProcessingFramework/SampleEncode.py Lines 49 to 55 in ba47dca
Choose option you find most suitable. One option is not better or worse than another, they are just different. To my best knowledge, you shall have no issues using CUDA memory objects created in single context in operations submitted to different streams. Speaking in VPF terms: you can pass As a rule of thumb, I recommend to use single CUDA context per one GPU and retain primary CUDA context instead of creating your own. This is illustrated in |
But what do you think, besides the CUDA streams? For instance, if they run asynchronously I'll probably need to put a thread lock around them to prevent the possibility of having surfaces switched across consumers. Also, I am not sure if a call to an instance of However, taking a step back, I am facing a bigger issue.
It works but it seems to be affected by a memory leak. I am running the code on a video streams and I have already reached 8 Gb of GPU memory used. And it keeps increasing. As far as I understood, there's no way to fix this using plain Python VPF. Do you see any possible solution? You mentioned this #257 (comment) but these possible solutions would require editing C++ code that performs video decoding, which is something unfortunately quite far from my domain knowledge. I am happy to try to dive into it, but I would like to know if you see any possible easier solution. |
Update: the code above reached
Not sure if that matters, VPF was compiled with |
I think you should be able to reproduce the memory leak using this public rtsp stream: rtsp://wowzaec2demo.streamlock.net/vod/mp4:BigBuckBunny_115k.mov and the following code:
Let me know if you can reproduce it. |
I don't have gstreamer installed on my machine and even if I install it with package manager, it's not going to be same as yours: I can decode mentioned rtsp using
Don't get me wrong with this, but I will not do the job for you. Please isolate the issue and make sure it lies inside VPF. |
Thank you for your help! |
Thank you for your work! It's very significant! Could you sharing your final code which decode as many RTSP streams as possible on a single GPU? Originally posted by @mfoglio in #257 (comment) |
Hi @stu-github , I am waiting for @rarzumanyan to finish up some fixes. I will share the code as soon as we have something more stable working |
Hi @stu-github and @mfoglio Just to notice: I’m in process of development of feature that shall allow to pass open file handles to demuxer which shall make RTSP cameras access easier, it’s not a fix. This is taking longer then expected. Meanwhile you can use the code sample from project wiki which illustrates how to read frames from RTSP camera with ffmpeg process. There’s one caviar which I’d like to address with mentioned new feature: right now one can only read from ffmpeg output in fixed size chunks. But actual compressed frames may be of different size which requires you to fine tune the speed VPF reads from ffmpeg pipe. |
Thanks for bringing this up, I think this is the reason indeed. Anyway, I was planning to merge to |
@rarzumanyan , I am still testing your code. Meanwhile I have a question: I think that before I could start a demuxer using |
Hi @mfoglio Yes, I've removed that functionality, stream support between C++ and Python turned to be a huge pain the back. I'd rather parse ffprobe output for that reason rather then support streams (users will inevitably try to use streams for actual demuxing). One possible way to work this around and get stream properties would be to use PyAV project because parsing ffprobe in my opinion is not a great idea. |
Take a look at this sample: https://github.com/NVIDIA/VideoProcessingFramework/blob/pyav_support/SamplePyav.py |
hi @mfoglio I've modified |
Thank you very much @rarzumanyan . What should I do if |
YUVJ420P basically means YUV420P with JPEG colour range (0;255), it's same to YUV420P in terms of Nvdec settings, shall also correspond to nvc.PixelFormat.NV12 I shall add this to sample. |
Thanks. Last question about video parameters: from this example https://github.com/NVIDIA/VideoProcessingFramework/blob/master/SampleDecodeMultiThread.py it looked like color space and range are needed to determine the correct pipeline to obtain RGB frames. |
I am not sure color range and color space can be accessed from pyav: PyAV-Org/PyAV#686 |
Hi @mfoglio
In VPF, there are 2 different color spaces supported (bt601, bt709) and 2 color ranges (mpeg, jpeg) which gives 4 possible ways of nv12 > rgb color conversions. If you provide converter with wrong parameters, it will do the conversion anyway, but the colors will be slightly off. Pictures below illustrate this case (taken from #226). They were converted with different color spaces: I don't know from personal experience, but those VPF users who use NN for live inference often say that color rendition accuracy is important aspect of inference prediction accuracy. This is the sole reason behind over complicated color conversion API (you can't "just" convert nv12 or yuv420 to rgb). I'll investigate on color space and color range extraction with PyAV. As a plan B we always have a ffprobe. Regarding the color yuvj420p pixel format, it's clear indicator of yuv420p + jpeg color range, this is what's said in FFMpeg pixel format description. |
Hi @mfoglio
Now there are less dependencies and more useful information ) Please check out |
vpf to decode rtmp url and Spawn FFMpeg sub-process to decode, |
Please elaborate on that, what is "wrong result"? |
Hello @rarzumanyan , after several attempts, I think I managed to fix my code and VPF seems to be stable right now. I will try to optimize my code to reduce VRAM usage and I will let you know if I find any issue. Thanks! |
Thanks for the update, @mfoglio Glad to hear you you was able to fix it, please LMK if you need further assistance. |
Thanks @rarzumanyan . I have a question regarding optimization. I have a process running my main application, and a few other processes running VPF to decode one video stream each. Now that with VPF we uses processes instead of threads, is there any benefit in using cuda streams with VPF? In other words, when we put VPF decoding of a video stream into a |
Hi @mfoglio Honestly I'm running out of depth here. To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect. I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior. |
Thanks @rarzumanyan . After decoding the packet, in SampleDecodeRTSP.py, `
I save to jpeg like this `
Is it correct? |
If it works, it's correct ;) |
It works. I want to find(or write) the more efficient code, but I can't achieve now :( Thank you! |
Hi @stu-github
Could you start a new issue on that topic? This one is getting chunky. |
Hi @mfoglio What's the current status of this issue? Do you see improvements / can we close it? |
Hi @mfoglio, it would be very helpful if you can share your code, or some tips you've learned along the way. Also, I see that PytorchNvCodec.cpp was recently updated to support specifying CUDA stream* and asynchronous copying when creating a PyTorch tensor. This is useful. * In current implementation, the |
Hi @jeshels
Thanks for bringing this up. P. S. |
@rarzumanyan, sure thing 👍 PyTorch API is different. Instead of accepting Since I'm a novice in this subject myself, I'm not sure which way is more appropriate for this case. Note that if you'd like to go with a C++ based solution, the function |
@mfoglio , hello, i have the same question with you.I want to drop some frames, not every frame is decoded. How did you solve this problem? What should I do based on VPF? Tansks a lot. |
@rarzumanyan can you explain how to do this? |
I want to decode as many RTSP streams as possible on a single GPU. Since my application is incapable of processing 30 FPS per streams, it wouldn't be an issue if some of the frames would be dropped. I probably won't need more than 5 FPS per streams. I am assuming there could be way to reduce the workload by dropping data at some unknown-to-me step during the pipeline.
I would also need to process the streams in real time. When following the PyTorch tutorial from the wiki I found some kind of delay: if I stopped my application for a while (e.g.
time.sleep(30)
) and then I resumed it, the pipeline was returning me frames from30
seconds ago. I would like the pipeline to always return real-time frames. I believe this would also imply using less memory since older data could be dropped. Memory is particularly important for me since I want to decode many streams.I just know the high details of h264 video decoding. I know that P, B, and I frames mean that you cannot simply drop some data and then start decoding without possibly encountering corrupting frames. However, I have encountered before similar issues with
gstreamer
on CPU (high CPU usage, more frames decoded then needed, delays and high memory usage) and I came up with a pipeline that was able to reduce delays (therefore also saving memory) while always returning me real-time (present) frames.How can I achieve my goal? Is there any argument I could pass to the
PyNvDecoder
? I see it can receivedict
as argument but I couldn't find more details.Here's the code that I am using so far. It is basically the PyTorch wiki tutorial:
Any hint on where to start would be really appreciated. This project is fantastic!
The text was updated successfully, but these errors were encountered: