Skip to content

Conversation

@lbhm
Copy link

@lbhm lbhm commented Jan 18, 2021

Training ResNet50 on ImageNet sometimes crashes during the validation phase when I am using the recommended V100 + AMP batch size of 256. The error message indicates memory allocation problems so I ran the DALI ImageDecoder within the ValPipe with memory=stats=True and added the reported numbers as device and host memory padding.

RuntimeError: Critical error in pipeline:
Error when executing Mixed operator ImageDecoder encountered:
Error in thread 1: [/opt/dali/dali/operators/decoder/nvjpeg/decoupled_api/nvjpeg_decoder_decoupled_api.h:809] NVJPEG error "5" : NVJPEG_STATUS_ALLOCATOR_FAILURE n02130308/ILSVRC2012_val_00033687.JPEG
Stacktrace (7 entries):
[frame 0]: /home/lbhm/venv/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x401dee) [0x7f7594e77dee]
[frame 1]: /home/lbhm/venv/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x7ad134) [0x7f7595223134]
[frame 2]: /home/lbhm/venv/lib/python3.8/site-packages/nvidia/dali/libdali_operators.so(+0x7adb04) [0x7f7595223b04]
[frame 3]: /home/lbhm/venv/lib/python3.8/site-packages/nvidia/dali/libdali.so(dali::ThreadPool::ThreadMain(int, int, bool)+0x217) [0x7f7593cca647]
[frame 4]: /home/lbhm/venv/lib/python3.8/site-packages/nvidia/dali/libdali.so(+0x8a6b5f) [0x7f7594414b5f]
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f77605696ba]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f775f74c4dd]

@nvpstr nvpstr requested a review from asulecki February 16, 2021 12:58
@nvpstr nvpstr assigned asulecki and nvpstr and unassigned nvpstr Feb 16, 2021
@lbhm lbhm closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants