Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

Optimize for multiple streams (drop frames, reduce delays, reduce memory usage) #257

Open
mfoglio opened this issue Oct 19, 2021 · 71 comments

Comments

@mfoglio
Copy link

mfoglio commented Oct 19, 2021

I want to decode as many RTSP streams as possible on a single GPU. Since my application is incapable of processing 30 FPS per streams, it wouldn't be an issue if some of the frames would be dropped. I probably won't need more than 5 FPS per streams. I am assuming there could be way to reduce the workload by dropping data at some unknown-to-me step during the pipeline.
I would also need to process the streams in real time. When following the PyTorch tutorial from the wiki I found some kind of delay: if I stopped my application for a while (e.g. time.sleep(30)) and then I resumed it, the pipeline was returning me frames from 30 seconds ago. I would like the pipeline to always return real-time frames. I believe this would also imply using less memory since older data could be dropped. Memory is particularly important for me since I want to decode many streams.
I just know the high details of h264 video decoding. I know that P, B, and I frames mean that you cannot simply drop some data and then start decoding without possibly encountering corrupting frames. However, I have encountered before similar issues with gstreamer on CPU (high CPU usage, more frames decoded then needed, delays and high memory usage) and I came up with a pipeline that was able to reduce delays (therefore also saving memory) while always returning me real-time (present) frames.
How can I achieve my goal? Is there any argument I could pass to the PyNvDecoder? I see it can receivedict as argument but I couldn't find more details.
Here's the code that I am using so far. It is basically the PyTorch wiki tutorial:

import torch
import PyNvCodec as nvc
import PytorchNvCodec as pnvc

gpu_id = 0
input_file = "rtsp_stream_url"

nvDec = nvc.PyNvDecoder(input_file, gpu_id)
target_h, target_w = nvDec.Height(), nvDec.Width()

cspace, crange = nvDec.ColorSpace(), nvDec.ColorRange()
if nvc.ColorSpace.UNSPEC == cspace:
    cspace = nvc.ColorSpace.BT_601
if nvc.ColorRange.UDEF == crange:
    crange = nvc.ColorRange.MPEG
cc_ctx = nvc.ColorspaceConversionContext(cspace, crange)

to_rgb = nvc.PySurfaceConverter(nvDec.Width(), nvDec.Height(), nvc.PixelFormat.NV12, nvc.PixelFormat.RGB, gpu_id)
to_planar = nvc.PySurfaceConverter(nvDec.Width(), nvDec.Height(), nvc.PixelFormat.RGB, nvc.PixelFormat.RGB_PLANAR, gpu_id)

while True:
    # Obtain NV12 decoded surface from decoder;
    rawSurface = nvDec.DecodeSingleSurface()
    if (rawSurface.Empty()):
        break

    rgb_byte = to_rgb.Execute(rawSurface, cc_ctx)
    rgb_planar = to_planar.Execute(rgb_byte, cc_ctx)

    surfPlane = rgb_planar.PlanePtr()
    surface_tensor = pnvc.makefromDevicePtrUint8(
        surfPlane.GpuMem(), surfPlane.Width(), surfPlane.Height(), surfPlane.Pitch(), surfPlane.ElemSize()
    )
    surface_tensor = surface_tensor.reshape(3, target_h, target_w)  # TODO: check that we are not copying data
    surface_tensor = surface_tensor.permute((1, 2, 0))  # TODO: check that we are not copying data
    # DO SLOW STUFF

Any hint on where to start would be really appreciated. This project is fantastic!

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Oct 20, 2021

Hi @mfoglio
I assume this issue is a best practice request.

Since my application is incapable of processing 30 FPS per streams, it wouldn't be an issue if some of the frames would be dropped. I probably won't need more than 5 FPS per streams.

Any time you start optimizing a SW which utilizes NVIDIA HW, a good place to start is nvidia-smi dmon CLI utility. It will show you the Nvdec, Nvenc, CUDA cores load levels, clocks and much more useful information.

Below you can see a screenshot of application which is clearly bottleneck-ed by CUDA cores performance:
image

As long as your application isn't limited by Nvdec performance, just decode all video frames one by one and discard frames you don't need.

Also, I don't recommend you to use single-threaded approach if you're aiming for top performance. Split it in 2 threads:

  • Producer thread decodes video frames, converts them to tensors and push to thread-safe queue.
  • Consumer thread takes tensors from queue and does the processing.

Memory is particularly important for me since I want to decode many streams

I can only help you to track the memory allocations happening in VPF, not in PyTorch.

@mfoglio
Copy link
Author

mfoglio commented Oct 20, 2021

Thank you @rarzumanyan for the nvidia-smi dmon tip. I think my application is limited by Nvdec memory (GPU memory) usage rather than by the number of FPS processed by the decoder. I would like to decode about 30 video streams but the consumer can't probably process more than 100-200 FPS per second.
In my actual application I am adopting a thread-safe queue approach like the one you described. For this reason, I would like to know if, while the decoder wait (after it pushed a frame on the queue), it can go into some kind of sleep or at least drop unnecessary data to free up GPU memory.
But here's the most important point: suppose there is a Queue of size=1. How can I guarantee that the queue always contains a close-to-real-time frame (e.g. no delays)? The code above returns old frames (from seconds to minutes ago) if the consumer can't keep up with it. I know that one solution from a generic Python perspective would be to have some kind of buffer (that replaces the queue) that always keeps in memory the last frame by discarding the old ones. But this approach seems to wait lots of resources, and more importantly. I would like to know if there is any parameter that would allow to guarantee that the frames returned by the decoder represent the present, not the past.
E.g. in gstreamer there are some parameters that would pause the pipeline (and I guess discard data) if the application can't consume the frames fast enough. This works better compared to the approach where gstreamer decodes frames as fast as possible and discard all of them but the last.

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Oct 20, 2021

Hi @mfoglio

I think my application is limited by Nvdec memory (GPU memory) usage rather than by the number of FPS processed by the decoder

There's no need to guess, there are lots of VPF profiling options:

  1. Use nvidia-smi dmon.
  2. Build VPF with USE_NVTX option and launch it under Nsight Systems to collect application timeline.
  3. Analyze extra data Nsight Systems may give you like CPU-side profiling.
  4. Use gprof and callgrind to inspect CPU-side performance.
  5. And such.

How can I guarantee that the queue always contains a close-to-real-time frame

Each frame when decoded has PTS which is presentation timestamp. It increases monotonically and by it's value you can estimate how "fresh" decoded frame is. Take a look at #253 for more information on this topic.

This works better compared to the approach where gstreamer decodes frames as fast as possible and discard all of them but the last.

SW design is a topic of it's own so I can't help you with anything more substantial than advice, but there are ways to mitigate this problem.

E. g. a signal / slot connection between your consumer and producer. Since PyNvDecoder usually starts RTSP stream decoding not from the beginning and doesn't go all the way till the end (camera is just broadcasting the data over network), your consumer may tell decoder when to start decoding next frame (e. g. when decoded frames queue is close to depletion). It may cause some delays in decoding and / or data corruption but that may happens every time you take a data from network.

@mfoglio
Copy link
Author

mfoglio commented Oct 20, 2021

Thank you for your detailed response.
Without getting much into details, I can see from the high-level nvidia-smi that I am using 3298MiB to decode 16 1080p streams. Is there a way to reduce the memory used? I don't need an exact answer. I am just wondering on what parameters I can start to play to do that: the decoder? The demultiplexer?

@rarzumanyan
Copy link
Contributor

Hi @mfoglio

Generally, the entry point to any investigation is the same - compile VPF with all diagnostic options possible and use existing CUDA profiling tools.

E. g. Nsight Systems profiler can track all the CUDA API calls, and VPF uses that to allocate memory for video frames. Hence, by looking at the application timeline, you will see exactly what’s happening and when.

Sometimes Nsight struggles to collect application timeline for multithreaded Python scripts so a simpler decoding script (such as one of VPF samples) is probably a good place to start.

@mfoglio
Copy link
Author

mfoglio commented Oct 23, 2021

Hello @rarzumanyan , could you provide more details about the difference between flushSingleSurface and decodeSingleSurface? Does the first one allow me to discard old video frames / data without decoding it? I am still trying to reduce the GPU memory used by the decoding pipeline.
Also, what should I do when I want to delete a decoder? For instance, in the code above, how would you proceed to clean/flush/release all the necessary stuff when you don't need to decode the video anymore?

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Oct 23, 2021

Hi @mfoglio

Nvdec is async by it’s nature and there’s a delay between encoded frame submission and moment it’s ready for display. This latency is hidden when PyNvDecoder class is created in builtin mode (with PyFfmpegDemuxer class within).

However, one can use external demuxer like Gstreamer, PyAV or any other demuxer which produces Annex.B elementary bitstream. In such case, PyNvDecoder acts asynchronously and after you’re input is over, there are still some frames in Nvdec queue.

FlushSingleSurface is used to flush one of such frames from queue. Take a closer look at SampleDemuxDecode.py for reference.

Regarding PyNvDecoder class deletion - it acts just same as any other Python class. When it’s lifetime is over, it cleans up it resources.

@mfoglio
Copy link
Author

mfoglio commented Oct 24, 2021

Hello @rarzumanyan , this is really interesting! I really appreciate your help. I have a few things in mind to try, as well as a few other questions... Sorry! And thanks!

Is there any way to set a maximum size for the queue that you mentioned above? This way I could avoid "wasting" GPU memory by keeping in memory frames that I would still drop later on because of a slow consumer. Otherwise I guess I could use a standalone ffmpeg demuxer and I could drop packets until my consumer is ready; at that point I could resume decoding packets using nvDec.DecodeSurfaceFromPacket(packet) until a valid surface is returned. I am not sure if this would return corrupted frames or if it would just create a small delay (because the decoder would wait until it has a valid frame before returning a surface).

It seems that PyFfmpegDemuxer can receive a dictionary as its second parameter. I guess this can be used to forward arguments to ffmpeg. Is this correct? If yes, what are the parameters that can be used? I am not sure what ffmpeg "object" is actually used byPyFfmpegDemuxer so I don't know where to look in the ffmpeg documentation.

Thank you, thank you, thank you!

@rarzumanyan
Copy link
Contributor

Hi @mfoglio,

Is there any way to set a maximum size for the queue that you mentioned above?

There are 2 places where the memory for decoded surfaces is allocated.

First is decoded surfaces pool size:

class PyNvDecoder {
std::unique_ptr<DemuxFrame> upDemuxer;
std::unique_ptr<NvdecDecodeFrame> upDecoder;
std::unique_ptr<PySurfaceDownloader> upDownloader;
uint32_t gpuID;
static uint32_t const poolFrameSize = 4U;
Pixel_Format format;

You can slightly reduce the memory consumption by changing the poolFrameSize variable value.

Second is the decoder initialization stage:

int NvDecoder::HandleVideoSequence(CUVIDEOFORMAT *pVideoFormat) noexcept {
try {
CudaCtxPush ctxPush(p_impl->m_cuContext);
CudaStrSync strSync(p_impl->m_cuvidStream);
int nDecodeSurface =
GetNumDecodeSurfaces(pVideoFormat->codec, pVideoFormat->coded_width,
pVideoFormat->coded_height);

GetNumDecodeSurfaces() function is used to determine how many surfaces are needed for Nvdec to ensure proper DPB operation. It allocates memory a bit generously in some cases but keeps the code simple.

You can get a better estimation of required surfaces amount by going through ff_nvdec_decode_init() function in libavcodec/nvdec.c file which is part of FFMpeg. It uses more sophisticated approach for various codecs DPB size determination. I'm not saying it's ideal but it's publicly available and it shows reasonable decoding memory consumption.

It seems that PyFfmpegDemuxer can receive a dictionary as its second parameter. I guess this can be used to forward arguments to ffmpeg. Is this correct? If yes, what are the parameters that can be used? I am not sure what ffmpeg "object" is actually used byPyFfmpegDemuxer so I don't know where to look in the ffmpeg documentation.

VPF accepts a dictionary that is converted to AVDictionary structure and passed to avformat_open_input() function which initialized AVFormatContext structure:

// Set up format context options;
AVDictionary *options = NULL;
for (auto &pair : ffmpeg_options) {
auto err =
av_dict_set(&options, pair.first.c_str(), pair.second.c_str(), 0);
if (err < 0) {
cerr << "Can't set up dictionary option: " << pair.first << " "
<< pair.second << ": " << AvErrorToString(err) << "\n";
return nullptr;
}
}
auto err = avformat_open_input(&ctx, nullptr, nullptr, &options);

@mfoglio
Copy link
Author

mfoglio commented Oct 25, 2021

Thanks @rarzumanyan .
I can only see the following compatible constructor arguments:

    1. PyNvCodec.PyNvDecoder(arg0: int, arg1: int, arg2: PyNvCodec.PixelFormat, arg3: PyNvCodec.CudaVideoCodec, arg4: int)
    2. PyNvCodec.PyNvDecoder(arg0: str, arg1: int, arg2: Dict[str, str])
    3. PyNvCodec.PyNvDecoder(arg0: str, arg1: int)
    4. PyNvCodec.PyNvDecoder(arg0: int, arg1: int, arg2: PyNvCodec.PixelFormat, arg3: PyNvCodec.CudaVideoCodec, arg4: int, arg5: int)
    5. PyNvCodec.PyNvDecoder(arg0: str, arg1: int, arg2: int, arg3: Dict[str, str])
    6. PyNvCodec.PyNvDecoder(arg0: str, arg1: int, arg2: int)

At the moment I am initializing the decoder with:

        # Initialize standalone demuxer.
        self.nvDmx = nvc.PyFFmpegDemuxer(encFile)  # {"latency": "0", "drop-on-latency": "true"}
        # Initialize decoder.
        self.nvDec = nvc.PyNvDecoder(
            self.nvDmx.Width(), self.nvDmx.Height(), self.nvDmx.Format(), self.nvDmx.Codec(), self.ctx.handle, self.str.handle
        )

How can I provide the parameterpoolFrameSize?

Possible OT: it seems that the demuxer keeps disconnecting from the rtsp stream. The following is a log captured in about a minute:

i-01f5ae3961a12c713 Thread-5 2021-10-25 17:16:19,137 - __main__ - INFO - FPS 0.0
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/l_4116_0_0.ts?nimblesessionid=339' for reading
[AVBSFContext @ 0x6e32e80] Invalid NAL unit 0, skipping.
# other output
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading

It ended with:

[hls,applehttp @ 0x6e1a6c0] Opening 'http://localhost:8081/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5/chunks.m3u8?nimblesessionid=339' for reading
[http @ 0x7ff285bdfde0] HTTP error 404 Not Found

However, ffprobe seems to find the stream:

ffprobe rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5
ffprobe version N-104411-gcf0881bcfc Copyright (c) 2007-2021 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/home/ubuntu/pycharm/libs/FFmpeg/build_x64_release_shared --disable-static --disable-stripping --disable-doc --enable-shared --enable-openssl --enable-network --enable-protocol=tcp --enable-demuxer=rtsp --enable-decoder=h264
  libavutil      57.  7.100 / 57.  7.100
  libavcodec     59. 12.100 / 59. 12.100
  libavformat    59.  6.100 / 59.  6.100
  libavdevice    59.  0.101 / 59.  0.101
  libavfilter     8. 15.100 /  8. 15.100
  libswscale      6.  1.100 /  6.  1.100
  libswresample   4.  0.100 /  4.  0.100
[rtsp @ 0x556be1c4ecc0] method SETUP failed: 461 Unsupported transport
Input #0, rtsp, from 'rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/00dd93c3-1236-4c63-8a5f-4c5b452430f5':
  Duration: N/A, start: 0.016667, bitrate: N/A
  Stream #0:0: Video: h264 (High), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 90k tbn

Sometime the demuxer self.nvDmx = nvc.PyFFmpegDemuxer(encFile) fails directly with:

ValueError: FFmpegDemuxer: no AVFormatContext provided.

If instead I try to initialize the decoder with nvc.PyFFmpegDemuxer(encFile, {"rtsp_transport": "tcp"}) and using an rtsp url instead of the m3u8 I encounter an error Failed to read frame: End of file after a few seconds, and sometimes a ValueError: Unsupported FFmpeg pixel format upon initialization.

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Oct 25, 2021

Hi @mfoglio

I can only see the following compatible constructor arguments:
How can I provide the parameter poolFrameSize?

This happens because pool size isn't exported to Python land, you have to change the hard-coded value in C++ and recompile VPF. Honestly, queue size was was never exported to Python simply because nobody has ever asked ))

However, ffprobe seems to find the stream:

Reading input from RTSP cameras is the single most painful thing to do.
I'd say 90% of user issues are about missing connection and such. There are multiple way of mitigating this, including demuxing with external demuxer (see project's wiki) or PyAV. Unfortunately, required PyAV functionality was never merged to PyAV main branch, so this problem stays half solved.

@mfoglio
Copy link
Author

mfoglio commented Oct 25, 2021

EDIT: the gstreamer pipeline seems to work. It wasn't because of a typo. The ffmpeg pipeline does not work.

Hello @rarzumanyan , and thanks again for following me through my journey.
I tried the example from the wiki without success. The Gstreamer option seems to be stuck without doing anything. FFmpeg does not return any frame.

Code to reproduce the issue:

from components.workers.video.vpf import initialize_vpf
import pycuda.driver as cuda
import ffmpeg
import subprocess
import pycuda.driver as cuda
import numpy as np
import PyNvCodec as nvc


rtsp_url = "rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1"

# Retain primary CUDA device context and create separate stream per thread.
cuda.init()
ctx = cuda.Device(0).retain_primary_context()
ctx.push()
str = cuda.Stream()
ctx.pop()


# Option 1: Gstreamer with pipeline from wiki

pipeline = \
    f"rtspsrc location={rtsp_url} " +\
    "protocols=tcp ! " + \
    "queue ! " + \
    "'application/x-rtp,media=video' ! " + \
    "rtph264depay ! " + \
    "h264parse ! " + \
    "video/x-h264, stream-format='byte-stream' ! " + \
    "filesink location=/dev/stdout"

proc = subprocess.Popen(
    f"/opt/intel/openvino_2021.1.110/data_processing/gstreamer/bin/gst-launch-1.0 {pipeline}",
    shell=True,
    stdout=subprocess.PIPE
)

# Option 2: FFmpeg (from wiki, not sure if it applies to rtsp streams)
args = (ffmpeg
        .input(rtsp_url)
        .output('pipe:', vcodec='copy', **{'bsf:v': 'h264_mp4toannexb'}, format='h264')
        .compile())
proc = subprocess.Popen(args, stdout=subprocess.PIPE)


# Decoder (parameters taken by tryingg to initialize demuxer multiple times until initialization succeeded)
video_width = 1920
video_height = 1080
video_format = nvc.PixelFormat.NV12
video_codec = nvc.CudaVideoCodec.H264
video_color_space = nvc.ColorSpace.BT_709
video_color_range = nvc.ColorRange.JPEG

# Initialize decoder.
nvDec = nvc.PyNvDecoder(
    video_width, video_height, video_format, video_codec, ctx.handle, str.handle
)
print("nvDec")

while True:

    # Read 4Kb of data as this is most common mem page size
    bits = proc.stdout.read(4096)
    if not len(bits):
        print("Empty page")
        continue

    # Decode
    packet = np.frombuffer(buffer=bits, dtype=np.uint8)

    # Decoder is async by design.
    # As it consumes packets from demuxer one at a time it may not return
    # decoded surface every time the decoding function is called.
    rawSurface = nvDec.DecodeSurfaceFromPacket(packet)
    if (rawSurface.Empty()):
        print("No more video frames")
        continue

    print("Surface decoded")  # never printed

Output for ffmpeg:

ffmpeg version N-104411-gcf0881bcfc Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/home/ubuntu/pycharm/libs/FFmpeg/build_x64_release_shared --disable-static --disable-stripping --disable-doc --enable-shared --enable-openssl --enable-network --enable-protocol=tcp --enable-demuxer=rtsp --enable-decoder=h264
  libavutil      57.  7.100 / 57.  7.100
  libavcodec     59. 12.100 / 59. 12.100
  libavformat    59.  6.100 / 59.  6.100
  libavdevice    59.  0.101 / 59.  0.101
  libavfilter     8. 15.100 /  8. 15.100
  libswscale      6.  1.100 /  6.  1.100
  libswresample   4.  0.100 /  4.  0.100
[rtsp @ 0x562c6c4e6400] method SETUP failed: 461 Unsupported transport
Input #0, rtsp, from 'rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1':
  Duration: N/A, start: 0.016656, bitrate: N/A
  Stream #0:0: Video: h264 (High), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 90k tbn
Output #0, h264, to 'pipe:':
  Metadata:
    encoder         : Lavf59.6.100
  Stream #0:0: Video: h264 (High), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 60 fps, 60 tbr, 60 tbn
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
No more video frames
No more video frames
No more video frames
...
No more video frames

I am looking for a public rtsp that you can use to replicate on your side (or to check that the code works on some streams on my side).

ffprobe output:

 ffprobe rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1
ffprobe version N-104411-gcf0881bcfc Copyright (c) 2007-2021 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/home/ubuntu/pycharm/libs/FFmpeg/build_x64_release_shared --disable-static --disable-stripping --disable-doc --enable-shared --enable-openssl --enable-network --enable-protocol=tcp --enable-demuxer=rtsp --enable-decoder=h264
  libavutil      57.  7.100 / 57.  7.100
  libavcodec     59. 12.100 / 59. 12.100
  libavformat    59.  6.100 / 59.  6.100
  libavdevice    59.  0.101 / 59.  0.101
  libavfilter     8. 15.100 /  8. 15.100
  libswscale      6.  1.100 /  6.  1.100
  libswresample   4.  0.100 /  4.  0.100
[rtsp @ 0x564cb71e9cc0] method SETUP failed: 461 Unsupported transport
Input #0, rtsp, from 'rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1':
  Duration: N/A, start: 0.016667, bitrate: N/A
  Stream #0:0: Video: h264 (High), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 90k tbn

I could test that the streams works fine with a:

import cv2
from matplotlib import pyplot as plt
cap = cv2.VideoCapture("rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1")
for _ in range(10):
    status, frame = cap.read()
    plt.imshow(frame)
    plt.show()
    

Also, if I write the ffmpeg args to the console, the console start printing binary data:

ffmpeg -i rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1 -f h264 -bsf:v h264_mp4toannexb -vcodec copy pipe:

@rarzumanyan
Copy link
Contributor

@mfoglio

Also, if I write the ffmpeg args to the console, the console start printing binary data

This is expected behavior.
If you take a closer look at command line:

ffmpeg -i rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1 -f h264 -bsf:v h264_mp4toannexb -vcodec copy pipe:

It reads data from RTSP source and applies h264_mp4toannexb bitstream filter. This filter extracts Annex.B elementary video stream from incoming data and puts the output to the pipe which is to be fed to Nvdec by VPF. Basically this is what demuxer does - it demultiplicates incoming data (video, audio, subtitles tracks etc.) into separate data streams. "Pure" video stream is binary data which is formed in special way and it's called Annex.B elementary bitstream because it conforms special binary syntax described in Annex B of H.264 / H.265 video codec standards.

Nvdec HW can't work with video containers like AVI, MKV or any other. It expects Annex.B elementary stream which consists of NAL Units. This is binary input which only contains compressed video without any extra information (like audio track or subtitles) because video codec standards only describes the video coding essentials and don't cover any video containers.

@mfoglio
Copy link
Author

mfoglio commented Oct 25, 2021

@rarzumanyan yes, I added that comment to confirm that ffmpeg was actually working on my machine. So it seems that ffmpeg returns data but the decoder cannot parse any frame.
In fact, the packet are not empty (I tried to print them). However, rawSurface = nvDec.DecodeSurfaceFromPacket(packet) always return an empty surface.

@rarzumanyan
Copy link
Contributor

Let's start from something simpler.
Just save your ffmpeg output to local file and decode it with SampleDecode.py. Need to make sure that RTSP part is the culprit.

@mfoglio
Copy link
Author

mfoglio commented Oct 25, 2021

I saved about a minute of video using ffmpeg -i rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1 -acodec copy -vcodec copy rtsp_stream.mp4. I restarted the code after replacing the rtsp url with the local file path. It decoded 1566 surfaces.
I used the ffmpeg demuxer and provided in my example above (using a subprocess).

EDIT: not sure if it's useful,. but here's the output of ffmpeg when saving the video (I stopped it with a Ctrl + C):

ffmpeg -i rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1 -acodec copy -vcodec copy rtsp_stream.mp4
ffmpeg version N-104411-gcf0881bcfc Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/home/ubuntu/pycharm/libs/FFmpeg/build_x64_release_shared --disable-static --disable-stripping --disable-doc --enable-shared --enable-openssl --enable-network --enable-protocol=tcp --enable-demuxer=rtsp --enable-decoder=h264
  libavutil      57.  7.100 / 57.  7.100
  libavcodec     59. 12.100 / 59. 12.100
  libavformat    59.  6.100 / 59.  6.100
  libavdevice    59.  0.101 / 59.  0.101
  libavfilter     8. 15.100 /  8. 15.100
  libswscale      6.  1.100 /  6.  1.100
  libswresample   4.  0.100 /  4.  0.100
[rtsp @ 0x5571f2668440] method SETUP failed: 461 Unsupported transport
Input #0, rtsp, from 'rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/0d943055-d4f2-49d2-a8fa-189176228ae1':
  Duration: N/A, start: 0.016667, bitrate: N/A
  Stream #0:0: Video: h264 (High), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 90k tbn
Output #0, mp4, to 'rtsp_stream.mp4':
  Metadata:
    encoder         : Lavf59.6.100
  Stream #0:0: Video: h264 (High) (avc1 / 0x31637661), yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 60 fps, 60 tbr, 90k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
[mp4 @ 0x5571f2673340] Timestamps are unset in a packet for stream 0. This is deprecated and will stop working in the future. Fix your code to set the timestamps properly
[mp4 @ 0x5571f2673340] Non-monotonous DTS in output stream 0:0; previous: 0, current: 0; changing to 1. This may result in incorrect timestamps in the output file.
frame= 1568 fps= 25 q=-1.0 Lsize=   69677kB time=00:01:01.38 bitrate=9299.0kbits/s speed=0.962x    
video:69668kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.013563%
Exiting normally, received signal 2.

@mfoglio
Copy link
Author

mfoglio commented Oct 25, 2021

@rarzumanyan I can confirm that the gstreamer pipeline works. There was a typo (missing space). The FFmpeg pipeline does not work but that's not an issue as long as gstreamer works.

Going back to the memory optimization, can instances of PySurfaceConverter and PySurfaceResizer be shared among multiple video and cuda streams safely? I guess I might have to fight a little bit against cuda streams to be able to share data among cuda streams. But can it theoretically be done or do the objects have internal attributes that would not allow to do that safely?

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Oct 26, 2021

Hi @mfoglio

can instances of PySurfaceConverter and PySurfaceResizer be shared among multiple video and cuda streams safely?

There are 2 types of constructors for most VPF classes which use CUDA:

  1. Those which accept GPU ID. In such case, CudaResMgr will provide class constructor with CUDA context and stream:
    PySurfaceConverter::PySurfaceConverter(uint32_t width, uint32_t height,
    Pixel_Format inFormat,
    Pixel_Format outFormat, uint32_t gpuID)
    : outputFormat(outFormat) {
    upConverter.reset(ConvertSurface::Make(
    width, height, inFormat, outFormat, CudaResMgr::Instance().GetCtx(gpuID),
    CudaResMgr::Instance().GetStream(gpuID)));
    upCtxBuffer.reset(Buffer::MakeOwnMem(sizeof(ColorspaceConversionContext)));
    }

CudaResMgr retains primary CUDA context for any given device and creates a single CUDA stream for any given device. So all VPF classes which are instantiated with same GPU ID will share the same context (primary CUDA context for given GPU ID) and same CUDA stream (not default CUDA stream, but one created by CudaResMgr).

  1. Those which take given CUDA context and stream as constructor arguments:
    PySurfaceConverter::PySurfaceConverter(uint32_t width, uint32_t height,
    Pixel_Format inFormat,
    Pixel_Format outFormat, CUcontext ctx,
    CUstream str)
    : outputFormat(outFormat) {
    upConverter.reset(ConvertSurface::Make(
    width, height, inFormat, outFormat, ctx, str));
    upCtxBuffer.reset(Buffer::MakeOwnMem(sizeof(ColorspaceConversionContext)));
    }

In such case, given CUDA context and stream references will be saved within class instance and used later on when doing CUDA stuff.

You can either rely on CudaResMgr and pass GPU ID or provide context and stream explicitly for more flexibility using pycuda. Both options are illustrated in samples:

def decode(gpuID, encFilePath, decFilePath):
cuda.init()
cuda_ctx = cuda.Device(gpuID).retain_primary_context()
cuda_ctx.push()
cuda_str = cuda.Stream()
cuda_ctx.pop()
decFile = open(decFilePath, "wb")
nvDmx = nvc.PyFFmpegDemuxer(encFilePath)
nvDec = nvc.PyNvDecoder(nvDmx.Width(), nvDmx.Height(), nvDmx.Format(), nvDmx.Codec(), cuda_ctx.handle, cuda_str.handle)
nvCvt = nvc.PySurfaceConverter(nvDmx.Width(), nvDmx.Height(), nvDmx.Format(), nvc.PixelFormat.YUV420, cuda_ctx.handle, cuda_str.handle)
nvDwn = nvc.PySurfaceDownloader(nvDmx.Width(), nvDmx.Height(), nvCvt.Format(), cuda_ctx.handle, cuda_str.handle)

def encode(gpuID, decFilePath, encFilePath, width, height):
decFile = open(decFilePath, "rb")
encFile = open(encFilePath, "wb")
res = str(width) + 'x' + str(height)
nvEnc = nvc.PyNvEncoder({'preset': 'P5', 'tuning_info' : 'high_quality', 'codec': 'h264',
'profile' : 'high', 's': res, 'bitrate' : '10M'}, gpuID)

Choose option you find most suitable. One option is not better or worse than another, they are just different.

To my best knowledge, you shall have no issues using CUDA memory objects created in single context in operations submitted to different streams. Speaking in VPF terms: you can pass Surface and CudaBuffer to VPF classes which use different streams if those Surface and CudaBuffer were created in same CUDA context.

As a rule of thumb, I recommend to use single CUDA context per one GPU and retain primary CUDA context instead of creating your own. This is illustrated in SampleDemuxDecode.py. If you're aiming at minimizing the memory consumption, I also don't recommend you to create any additional CUDA contexts as there are some driver-internal objects stored in vRAM associated with each active context.

@mfoglio
Copy link
Author

mfoglio commented Oct 26, 2021

But what do you think, besides the CUDA streams? For instance, if they run asynchronously I'll probably need to put a thread lock around them to prevent the possibility of having surfaces switched across consumers. Also, I am not sure if a call to an instance of PySurfaceConverter and PySurfaceResizer is affected by a previous call to the same objects: for instance you wouldn't be able to feed multiple video to the same decoder because it wouldn't be able to decode frames.

However, taking a step back, I am facing a bigger issue.
This is the code that I have so far:

import subprocess
import numpy as np
import PyNvCodec as nvc


rtsp_url = "rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/57a77f83-6fed-4316-a2c9-2c8813a49fe1"

pipeline = \
    f"rtspsrc location={rtsp_url} " +\
    "protocols=tcp ! " + \
    "queue ! " + \
    "'application/x-rtp,media=video' ! " + \
    "rtph264depay ! " + \
    "h264parse ! " + \
    "video/x-h264, stream-format='byte-stream' ! " + \
    "filesink location=/dev/stdout"

proc = subprocess.Popen(
    f"/opt/intel/openvino_2021.1.110/data_processing/gstreamer/bin/gst-launch-1.0 {pipeline}",
    shell=True,
    stdout=subprocess.PIPE
)

# Decoder (parameters taken by tryingg to initialize demuxer multiple times until initialization succeeded)
video_width = int(1920)
video_height = int(1080)
video_format = nvc.PixelFormat.NV12
video_codec = nvc.CudaVideoCodec.H264

# Initialize decoder.
nvDec = nvc.PyNvDecoder(
    video_width, video_height, video_format, video_codec, 0
)

c = 0
while True:

    bits = proc.stdout.read(4090)
    if not len(bits):
        continue

    packet = np.frombuffer(buffer=bits, dtype=np.uint8)
    rawSurface = nvDec.DecodeSurfaceFromPacket(packet)
    if rawSurface.Empty():
        continue

    print(f"Surface decoded {c}")
    c = c + 1

It works but it seems to be affected by a memory leak. I am running the code on a video streams and I have already reached 8 Gb of GPU memory used. And it keeps increasing. As far as I understood, there's no way to fix this using plain Python VPF. Do you see any possible solution? You mentioned this #257 (comment) but these possible solutions would require editing C++ code that performs video decoding, which is something unfortunately quite far from my domain knowledge. I am happy to try to dive into it, but I would like to know if you see any possible easier solution.

@mfoglio
Copy link
Author

mfoglio commented Oct 26, 2021

Update: the code above reached out of memory error on a T4.

Surface decoded 30061
/home/ubuntu/pycharm/libs/VideoProcessingFramework/PyNvCodec/TC/src/NvDecoder.cpp:526
CUDA error: CUDA_ERROR_OUT_OF_MEMORY
out of memory
/home/ubuntu/pycharm/libs/VideoProcessingFramework/PyNvCodec/TC/src/NvDecoder.cpp:526
CUDA error: CUDA_ERROR_OUT_OF_MEMORY
out of memory
/home/ubuntu/pycharm/libs/VideoProcessingFramework/PyNvCodec/TC/src/NvDecoder.cpp:488
CUDA error: CUDA_ERROR_MAP_FAILED
mapping of buffer object failed
Surface decoded 30062
Cuvid parser faced error.

Not sure if that matters, VPF was compiled with USE_NVTX.

@rarzumanyan
Copy link
Contributor

@mfoglio

It works but it seems to be affected by a memory leak.

Can't reproduce on my machine with SampleDecode.py or SampleDemuxDecode.py
I run command like this:

python3 ./SampleDecode.py 0 ~/Videos/bbb_sunflower_1080p_30fps_normal.mp4 ./tmp.nv12

Memory consumption during decoding:
image

python3 uses 195 MB constantly.

Memory consumption when decoding is over:
image

Same thing with SampleDemuxDecode.py

Why not to do incremental analysis? If SampleDemuxDecode.py doesn't show memory leaks, add another layer over the top of that and take input from pipeline.

P. S.
From what I see in HW description (Tesla T4), looks like you're using VPF in production.
Let's establish a contact via email and discuss what could be done. My work email is in profile info.
Honestly, to me this thread doesn't look like a VPF issue but rather as SW development consulting, so let's bring it to new level ))

@mfoglio
Copy link
Author

mfoglio commented Oct 26, 2021

I think you should be able to reproduce the memory leak using this public rtsp stream: rtsp://wowzaec2demo.streamlock.net/vod/mp4:BigBuckBunny_115k.mov and the following code:

import subprocess
import numpy as np
import PyNvCodec as nvc


# rtsp_url = "rtsp://localhost/cbb48b92-f74d-4ad5-b8a9-3affbefcc17e_default/57a77f83-6fed-4316-a2c9-2c8813a49fe1"
rtsp_url = "rtsp://wowzaec2demo.streamlock.net/vod/mp4:BigBuckBunny_115k.mov"

pipeline = \
    f"rtspsrc location={rtsp_url} " +\
    "protocols=tcp ! " + \
    "queue ! " + \
    "'application/x-rtp,media=video' ! " + \
    "rtph264depay ! " + \
    "h264parse ! " + \
    "video/x-h264, stream-format='byte-stream' ! " + \
    "filesink location=/dev/stdout"

proc = subprocess.Popen(
    f"/opt/intel/openvino_2021.1.110/data_processing/gstreamer/bin/gst-launch-1.0 {pipeline}",
    shell=True,
    stdout=subprocess.PIPE
)

# Decoder (parameters taken by tryingg to initialize demuxer multiple times until initialization succeeded)
video_width = int(1920)
video_height = int(1080)
video_format = nvc.PixelFormat.NV12
video_codec = nvc.CudaVideoCodec.H264

# Initialize decoder.
nvDec = nvc.PyNvDecoder(
    video_width, video_height, video_format, video_codec, 0
)

c = 0
while True:

    bits = proc.stdout.read(4090)
    if not len(bits):
        continue

    packet = np.frombuffer(buffer=bits, dtype=np.uint8)
    rawSurface = nvDec.DecodeSurfaceFromPacket(packet)
    if rawSurface.Empty():
        continue

    print(f"Surface decoded {c}")
    c = c + 1

Let me know if you can reproduce it.
I launched the gstreamer pipeline in a console, and, as expected, it does not use any gpu memory. So I would assume the memory leak is caused by vpf.
I'll contact you ;)

@rarzumanyan
Copy link
Contributor

@mfoglio

I don't have gstreamer installed on my machine and even if I install it with package manager, it's not going to be same as yours: /opt/intel/openvino_2021.1.110/data_processing/gstreamer/bin/gst-launch-1.0

I can decode mentioned rtsp using SampleDemuxDecode.py with non-volatile GPU memory consumption:

python3 ./SampleDemuxDecode.py 0 rtsp://wowzaec2demo.streamlock.net/vod/mp4:BigBuckBunny_115k.mov ./tmp.yuv

image

Don't get me wrong with this, but I will not do the job for you. Please isolate the issue and make sure it lies inside VPF.

@mfoglio
Copy link
Author

mfoglio commented Oct 26, 2021

Thank you for your help!
In order to verify whether it was a VPF issue, I decided to start from a new clean Ubuntu installation.
The Gstreamer pipeline now works without any memory leak.
As for ffmpeg it does not work with every rtsp stream (it gets stuck for some), but I found out a possible fix. The example in the wiki ( https://github.com/NVIDIA/VideoProcessingFramework/wiki/Decoding-video-from-RTSP-camera) with ffmpeg works with problematic streams if we replace {'bsf:v': 'h264_mp4toannexb'} with {'bsf:v': 'h264_mp4toannexb,dump_extra'}.
Happy to make a PR for this extremely small, but hopefully useful minuscule fix.

@stu-github
Copy link

@mfoglio

Thank you for your work! It's very significant!

Could you sharing your final code which decode as many RTSP streams as possible on a single GPU?

Originally posted by @mfoglio in #257 (comment)

@mfoglio
Copy link
Author

mfoglio commented Nov 19, 2021

Hi @stu-github , I am waiting for @rarzumanyan to finish up some fixes. I will share the code as soon as we have something more stable working

@rarzumanyan
Copy link
Contributor

Hi @stu-github and @mfoglio

Just to notice: I’m in process of development of feature that shall allow to pass open file handles to demuxer which shall make RTSP cameras access easier, it’s not a fix. This is taking longer then expected.

Meanwhile you can use the code sample from project wiki which illustrates how to read frames from RTSP camera with ffmpeg process.

There’s one caviar which I’d like to address with mentioned new feature: right now one can only read from ffmpeg output in fixed size chunks. But actual compressed frames may be of different size which requires you to fine tune the speed VPF reads from ffmpeg pipe.

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Nov 23, 2021

@mfoglio

I noticed you both have a branch and a tag called v1.1.1

Thanks for bringing this up, I think this is the reason indeed.
I've got to learn a thing or two about scheduling releases in GitHub!

Anyway, I was planning to merge to master before v1.1.1 branch diverges too far away.
So please find the latest changes in master ToT.

@mfoglio
Copy link
Author

mfoglio commented Nov 23, 2021

@rarzumanyan , I am still testing your code. Meanwhile I have a question: I think that before I could start a demuxer using self.nvDmx = nvc.PyFFmpegDemuxer(self.proc.stdout). Would there be a way to make this available? I am not sure it would work, but maybe it would be possible to read height, width, format and codec from this demuxer. Again, not sure if that would make this information retrieval part more stable, but maybe it's worth an attempt.

@rarzumanyan
Copy link
Contributor

Hi @mfoglio

Yes, I've removed that functionality, stream support between C++ and Python turned to be a huge pain the back.

I'd rather parse ffprobe output for that reason rather then support streams (users will inevitably try to use streams for actual demuxing).

One possible way to work this around and get stream properties would be to use PyAV project because parsing ffprobe in my opinion is not a great idea.

@rarzumanyan
Copy link
Contributor

@mfoglio

Take a look at this sample: https://github.com/NVIDIA/VideoProcessingFramework/blob/pyav_support/SamplePyav.py
It was developed when I was experimenting with PyAV bitstream filters and shows how to get stream properties with PyAV.

@rarzumanyan
Copy link
Contributor

hi @mfoglio

I've modified SampleDecodeRTSP.py in master branch, it now uses PyAV to get video stream properties, please take a look.

@mfoglio
Copy link
Author

mfoglio commented Nov 23, 2021

Thank you very much @rarzumanyan . What should I do if in_stream.codec_context.pix_fmt is equal to yuvj420p?

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Nov 23, 2021

@mfoglio

YUVJ420P basically means YUV420P with JPEG colour range (0;255), it's same to YUV420P in terms of Nvdec settings, shall also correspond to nvc.PixelFormat.NV12

I shall add this to sample.

@mfoglio
Copy link
Author

mfoglio commented Nov 23, 2021

Thanks. Last question about video parameters: from this example https://github.com/NVIDIA/VideoProcessingFramework/blob/master/SampleDecodeMultiThread.py it looked like color space and range are needed to determine the correct pipeline to obtain RGB frames.
Is this true or it is sufficient to know the format (YUVJ420P, YUV420 YUV444, etc)?

@mfoglio
Copy link
Author

mfoglio commented Nov 23, 2021

I am not sure color range and color space can be accessed from pyav: PyAV-Org/PyAV#686

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Nov 24, 2021

Hi @mfoglio

color space and range are needed to determine the correct pipeline to obtain RGB frames

In VPF, there are 2 different color spaces supported (bt601, bt709) and 2 color ranges (mpeg, jpeg) which gives 4 possible ways of nv12 > rgb color conversions.

If you provide converter with wrong parameters, it will do the conversion anyway, but the colors will be slightly off. Pictures below illustrate this case (taken from #226). They were converted with different color spaces:
image
image

I don't know from personal experience, but those VPF users who use NN for live inference often say that color rendition accuracy is important aspect of inference prediction accuracy. This is the sole reason behind over complicated color conversion API (you can't "just" convert nv12 or yuv420 to rgb).

I'll investigate on color space and color range extraction with PyAV. As a plan B we always have a ffprobe. Regarding the color yuvj420p pixel format, it's clear indicator of yuv420p + jpeg color range, this is what's said in FFMpeg pixel format description.

@rarzumanyan
Copy link
Contributor

Hi @mfoglio

ffprobe can actually produce JSON output so parsing is easy and clean.
I've replaced PyAV with ffprobe, all necessary stream parameters are now extracted including color space and color range. These are optional, not all streams have them. I've used BT601 and MPEG as default values.

Now there are less dependencies and more useful information )

Please check out SampleDecodeRTSP.py in master ToT.

@stepstep123
Copy link

yuvj420p
@mfoglio @rarzumanyan
we got the same problem.

vpf to decode rtmp url and Spawn FFMpeg sub-process to decode,
but vpf got wrong result and FFMpeg got None reuslt.

@rarzumanyan
Copy link
Contributor

@stepstep123

Please elaborate on that, what is "wrong result"?

@mfoglio
Copy link
Author

mfoglio commented Nov 30, 2021

Hello @rarzumanyan , after several attempts, I think I managed to fix my code and VPF seems to be stable right now. I will try to optimize my code to reduce VRAM usage and I will let you know if I find any issue. Thanks!

@rarzumanyan
Copy link
Contributor

Thanks for the update, @mfoglio

Glad to hear you you was able to fix it, please LMK if you need further assistance.
After we resolve this issue, please consider sharing your findings regarding massive RTSP processing, I'm sure it will be extremely helpful to other VPF users.

@mfoglio
Copy link
Author

mfoglio commented Nov 30, 2021

Thanks @rarzumanyan . I have a question regarding optimization. I have a process running my main application, and a few other processes running VPF to decode one video stream each. Now that with VPF we uses processes instead of threads, is there any benefit in using cuda streams with VPF? In other words, when we put VPF decoding of a video stream into a Process instead of a Thread, do we still need to use streams to avoid conflicts across different processes?

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Nov 30, 2021

Hi @mfoglio

Honestly I'm running out of depth here.

To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.

I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.

@stu-github
Copy link

stu-github commented Dec 6, 2021

Hi @mfoglio

Honestly I'm running out of depth here.

To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.

I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.

Thanks @rarzumanyan .

After decoding the packet,

in SampleDecodeRTSP.py,

`

    # Decode
    enc_packet = np.frombuffer(buffer=bits, dtype=np.uint8)
    pkt_data = nvc.PacketData()
    try:
        surf = nvdec.DecodeSurfaceFromPacket(enc_packet, pkt_data)
        if not surf.Empty():
            fd += 1
            # Shifts towards underflow to avoid increasing vRAM consumption.
            if pkt_data.bsl < read_size:
                read_size = pkt_data.bsl
            # Print process ID every second or so.
            fps = int(params['framerate'])
            if not fd % fps:
                print(name)`

I save to jpeg like this

`

            yuv = to_yuv.Execute(surf, cc2)
            rgb24 = to_rgb.Execute(yuv, cc2)
            rgb24.PlanePtr().Export(surface_tensor.data_ptr(), w * 3, gpu_id)

            # PROCESS YOUR TENSOR HERE.
            # THIS DUMMY PROCESSING WILL JUST MAKE VIDEO FRAMES DARKER.
            dark_frame = torch.floor_divide(surface_tensor, 2)

            pil = Image.fromarray(surface_tensor.cpu().numpy())
            pil.save('output/%d.jpg'%index)`

Is it correct?

@rarzumanyan
Copy link
Contributor

I save to jpeg like this
Is it correct?

If it works, it's correct ;)
If not - inspect raw RGB frame with OpenCV to see what's happening.

@stu-github
Copy link

I save to jpeg like this
Is it correct?

If it works, it's correct ;) If not - inspect raw RGB frame with OpenCV to see what's happening.

It works.

I want to find(or write) the more efficient code, but I can't achieve now :(

Thank you!

@rarzumanyan
Copy link
Contributor

Hi @stu-github

I want to find(or write) the more efficient code, but I can't achieve now :(

Could you start a new issue on that topic? This one is getting chunky.

@rarzumanyan
Copy link
Contributor

Hi @mfoglio

What's the current status of this issue? Do you see improvements / can we close it?

@jeshels
Copy link

jeshels commented Apr 14, 2022

Hi @stu-github , I am waiting for @rarzumanyan to finish up some fixes. I will share the code as soon as we have something more stable working

Hi @mfoglio, it would be very helpful if you can share your code, or some tips you've learned along the way.

Also, I see that PytorchNvCodec.cpp was recently updated to support specifying CUDA stream* and asynchronous copying when creating a PyTorch tensor. This is useful.

* In current implementation, the torch::full ignores the user provided CUDA stream and operates on the globally set CUDA stream of PyTorch, but it can be fixed.

@rarzumanyan
Copy link
Contributor

rarzumanyan commented Apr 14, 2022

Hi @jeshels

In current implementation, the torch::full ignores the user provided CUDA stream

Thanks for bringing this up.
I'm not an expert in torch so if you find thing like that please feel free to submit an issue.

P. S.
As far as I understood the torch c++ tensor creation API, torch::full doesn't accept CUDA stream as argument. Am I missing something?

@jeshels
Copy link

jeshels commented Apr 17, 2022

@rarzumanyan, sure thing 👍

PyTorch API is different. Instead of accepting stream as a parameter for every function, one needs to set the CUDA stream separately. Then, everything which executes afterwards is run in the context of that CUDA stream until a new CUDA stream is set. This can be controlled from both Python and C++. As far as I understand, setting a CUDA stream in one thread doesn't affect other threads.

Since I'm a novice in this subject myself, I'm not sure which way is more appropriate for this case.

Note that if you'd like to go with a C++ based solution, the function getStreamFromExternal() may be useful if you're having trouble passing a Pytorch CUDA stream object from Python to C++. However, note that due to this issue, this solution would require PyTorch version >= 1.11.0.

@pzyang613
Copy link

pzyang613 commented Apr 27, 2022

@mfoglio , hello, i have the same question with you.I want to drop some frames, not every frame is decoded. How did you solve this problem? What should I do based on VPF? Tansks a lot.

@timsainb
Copy link

  • Build VPF with USE_NVTX option and launch it under Nsight Systems to collect application timeline.

@rarzumanyan can you explain how to do this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants