Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BraTS 2021/PyTorch] Model not properly training #1304

Open
DanielNajarian opened this issue Jun 12, 2023 · 8 comments
Open

[BraTS 2021/PyTorch] Model not properly training #1304

DanielNajarian opened this issue Jun 12, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@DanielNajarian
Copy link

When running the BraTS 2021 notebook (located at PyTorch/Segmentation/nnUNet/notebooks/BraTS21.ipynb) training section, the model is not properly training even though it is going through the steps, as seen in the image below. The Dice is stuck at an extremely low value and neither that nor the loss changes at all over the epochs. The "DALI iterator does not support resetting while epoch is not finished" warning comes up on every epoch but that is not something that I have touched.

image

To Reproduce
Steps to reproduce the behavior:

  1. Clone the DeepLearningExamples repo and Install the dependencies
  2. Download the BraTS 2021 dataset
  3. Change paths in the BraTS 2021 notebook to point to file locations
  4. Run all of the steps up to and including the training stage

Expected behavior
I expected the model to train and have at least a Dice of 70 after 5 epochs

Environment
Please provide at least:

  • PyTorch version: 1.13.1+cu116
  • GPUs in the system: 2x Tesla V100-SXM2-16GB:
  • CUDA driver version 515.86.01:
@DanielNajarian DanielNajarian added the bug Something isn't working label Jun 12, 2023
@michal2409
Copy link
Contributor

Are you running notebook inside docker container? It looks like a dependency issue (running notebook with different versions of dependency). Please see https://ploomber.io/blog/notebook-to-docker/ for reference how to run Jupyter Notebook inside container.

@DanielNajarian
Copy link
Author

I'm running it through command line and built the environment based on their requirements files.

@michal2409
Copy link
Contributor

What versions for PyTorch and NVIDIA DALI are you using?

@DanielNajarian
Copy link
Author

DanielNajarian commented Jun 20, 2023

I am using torch 1.13.1+cu116 and nvidia-dali-cuda110 1.26.0. Looking at it now, DALI should be cuda116, correct? But there doesnt seem to be a cuda116 version of it.

@michal2409
Copy link
Contributor

michal2409 commented Jun 20, 2023

22.11 container has 1.18.0 DALI version (see here). Were you manually reinstalling it?

@DanielNajarian
Copy link
Author

I had to manually reinstall a few packages since the torch and torchvision CUDA versions weren't lined up and I had trouble getting 117 to work on both, so I went down to 116 and changed some stuff as a result.

Should I be focusing on 22.02 container since it lines up with CUDA 11.6, which is my torch version? This would be DALI 1.10.

@michal2409
Copy link
Contributor

You can experiment with different versions. I would start with DALI 1.18.0 (or not reinstalling it inside container).

What error log you had during running container without any modification?

@Luffy03
Copy link

Luffy03 commented Oct 17, 2023

Hi, have you figured it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants