Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgl 3.0.90-20221122 crashing torch on ubuntu 22.04 #227

Open
muratmaga opened this issue Feb 23, 2023 · 13 comments
Open

vgl 3.0.90-20221122 crashing torch on ubuntu 22.04 #227

muratmaga opened this issue Feb 23, 2023 · 13 comments

Comments

@muratmaga
Copy link

We are using vgl on a cloud server with A100 gpu (grid driver 525.60.13) with 3D Slicer with egl backend.

If we invoke the slicer without vglrun, our torch based application works fine. If we start the Slicer via
vglrun -d egl ./Slicer

a little after the torch starts loading the trained model, application crashes. There is no error message (neither in Slicer not in vglrun) that I can locate.

I am assuming it is a weird combination of the driver being used and VGL, as we can't replicate this crash on our local machine (though we use ubuntu 20.04, not 22.04, and we have RTX A4000 not A100).

I would appreciate some pointers on how to troubleshoot this. Thanks.

@dcommander
Copy link
Member

Does it segfault? If so, can you get a stack trace to determine where it is crashing? (NOTE: Running the application with vglrun +de will cause VirtualGL to pause so you can attach GDB.)

Does Slicer otherwise work with vglrun -d egl? In other words, is it only a specific Slicer workflow (your torch-based application) that fails? I don't know what "torch" or a "torch-based application" is.

It's been my experience that the GPU hardware rarely matters when diagnosing VirtualGL-related crashes, but the driver version might. Have you tested the non-GRID version of 525.60.xx on the local machine?

Have you tried building VirtualGL from source on the cloud server? Our official binaries are built on CentOS 7, so perhaps something changed in one of the ABIs that is causing difficulties.

Have you verified that you can run other applications, such as GLXspheres, using vglrun -d egl on the cloud server?

@dcommander
Copy link
Member

It would also be worth testing whether this same crash occurs with VirtualGL 3.0.2. That will tell me whether it is caused by a new feature in VirtualGL 3.1 beta.

@muratmaga
Copy link
Author

Torch is a ML framework in python (also called pytorch). https://pytorch.org/

yes, Slicer works completely find with vglrun -d egl (i.e for regular 3D rendering using GPU) and , and all other vglrun sanity checks (glxspheres etc) are fine too. Unfortunately I cannot install the non-grid driver as those are baked into the image the cloud provider provides (I tried once, and everything went haywire).

I will try the remaining suggestions and report back later.

@dcommander
Copy link
Member

I suggested that you try the non-GRID driver on the local machine, not on the cloud server. Since you are unable to reproduce the issue on the local machine, perhaps upgrading the driver to a similar version as the driver on the cloud server will cause the issue to occur. This is all in the interest of collecting data points to isolate the issue's cause. If upgrading the driver on the local machine causes the issue to occur, then it is likely that I will also be able to reproduce the issue by upgrading the driver on my machine. If the issue always occurs with VirtualGL 3.1 beta and not with 3.0.2, then it is likely a regression caused by a feature in 3.1 beta, and it is also likely that I will be able to reproduce it. If, however, the issue is likely specific to the cloud server, then there is no sense in me wasting my time trying to repro it locally.

@dcommander
Copy link
Member

Also, please specify exactly how the application is crashing. Searching the issues, I came across another one related to PyTorch and VGL that has to do with certain libraries not being picked up if they are linked with an RPATH of $ORIGIN.

@zacharygraber
Copy link

Hi there! @dcommander -- I work with the cloud system in question here (Jetstream2), and I figured I might be able to jump in and give a little context about the setup. We have a number of A100s in our GPU compute nodes that are provisioned to users through VMs; because not everyone is interested in using an entire A100, though, we use NVIDIA's proprietary GRID drivers to effectively "slice up" an A100 into 5 parts.

For example, when a user spins up a VM, they may choose a small GPU "flavor." In that case, the hypervisor will "slice up" a physical GPU and pass through to the VM a smaller virtual GPU (vGPU); on the backend this vGPU only represents ~1/5th of an A100's compute and VRAM, but the virtual machine effectively sees it as one physical card on a PCI interface. This is why it is impossible to install a non-GRID driver on the cloud machine (VM); NVIDIA's proprietary GRID drivers will recognize the vGPU and understand how to properly use it, while a non-GRID driver is only really intended for communicating with an entire card that's physically connected to the machine.

Unfortunately, NVIDIA's licensing terms for the GRID drivers and vGPU solutions in general are extremely restrictive. Our sysadmins receive driver installer packages directly from NVIDIA and we are unable to distribute them to anyone else; in other words, we can provide an OS image with the drivers pre-installed, but we cannot give a user the .deb or .rpm file directly. This makes troubleshooting matters like this a bit difficult, especially if you fear the issue might be in the drivers (for example, the user has no way of setting up the GRID drivers on their local machine).

Hopefully this gives a bit of context. Since the setup on Jetstream2 differs so much from what a local machine would have (one GPU directly hooked into one machine with a "normal" driver), unless anyone is able to reproduce the issue on their local machine, comparing functionality on locals versus the cloud VM is a bit of an "apples to oranges" scenario.

@dcommander
Copy link
Member

@zacharygraber I understand all of that. I've been developing VirtualGL for nearly 20 years, and it is my experience that, when nVidia drivers cause a problem with VirtualGL, it is because of something in nVidia's libGL implementation, which is abstracted from the hardware. Thus, it might be a useful data point to test a non-GRID installation of 525.60.xx on the local machine.

However, that assertion is coming from a place of having no information regarding how the application is crashing, because @muratmaga has not yet provided me with that information. It's entirely possible that this is the same issue as #107, in which case it is a known issue with PyTorch that has to be worked around when using that framework with VGL. (There's even an application recipe for it in the VirtualGL User's Guide: https://rawcdn.githack.com/VirtualGL/virtualgl/3.1beta1/doc/index.html#hd0015.)

In other words, if you guys want more intelligent suggestions from me, then please provide more information about the problem. Otherwise, I'm shooting in the dark.

@muratmaga
Copy link
Author

muratmaga commented Feb 28, 2023

@dcommander I didn't have the time to try some of the more involved things you asked such as trying grid driver on our local machine (which I can't do it as we constantly use it for many tasks) and debugging. But I did try things with vgl 3.0.2, and crash replicates. So it is not something recently introduced with the beta version.

I also tried the recipe for issue #107 and this is what I get:

exouser@memos:~$ vglrun -d egl -nodl ./Slicer/Slicer 
[VGL] ERROR: Could not load EGL functions
[VGL]    /lib/libvglfaker-nodl.so: undefined symbol: eglGetProcAddress

exouser@memos:~$ vglrun -d egl -ld /home/exouser/Slicer/lib/Python/lib/python3.9/site-packages/torch/lib ./Slicer/Slicer 
libpng warning: iCCP: profile 'ICC Profile': 'CMYK': invalid ICC profile color space
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: too many profiles
libpng warning: iCCP: known incorrect sRGB profile
Switch to module:  "Welcome"
Loaded volume from file: /home/exouser/Slicer/NA-MIC/Extensions-31317/SlicerMorph/lib/Slicer-5.2/qt-scripted-modules/Resources/Icons/Mouse_CT.png. Dimensions: 64x64x1. Number of components: 3. Pixel type: unsigned char.


Loaded volume from file: /home/exouser/Slicer/NA-MIC/Extensions-31317/SlicerMorph/lib/Slicer-5.2/qt-scripted-modules/Resources/Icons/VolrenRed_8bit.png. Dimensions: 64x64x1. Number of components: 4. Pixel type: unsigned char.


Loaded volume from file: /home/exouser/Slicer/NA-MIC/Extensions-31317/SlicerMorph/lib/Slicer-5.2/qt-scripted-modules/Resources/Icons/VolrenRed_16bit.png. Dimensions: 64x64x1. Number of components: 4. Pixel type: unsigned char.


Loaded volume from file: /home/exouser/Downloads/undeterminedSex_AAPN_K1026-1-e15.5_Cbx4.nrrd. Dimensions: 259x258x421. Number of components: 1. Pixel type: unsigned char.


"Volume" Reader has successfully read the file "/home/exouser/Downloads/undeterminedSex_AAPN_K1026-1-e15.5_Cbx4.nrrd" "[0.19s]"
Switch to module:  "MEMOS"
"Color" Reader has successfully read the file "/home/exouser/Slicer/NA-MIC/Extensions-31317/MEMOS/lib/Slicer-5.2/qt-scripted-modules/Resources/Support/KOMP2.ctbl" "[0.01s]"
Generic Warning: In /work/Stable/Slicer-0/Libs/MRML/Core/vtkDataFileFormatHelper.cxx, line 237
vtkDataFileFormatHelper::GetFileExtensionFromFormatString: please update deprecated extension-only format specifier to 'File format name (.ext)' format! Current format string: .nii.gz


Using device:  0
MONAI version: 0.9.0
Numpy version: 1.23.4
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: af0e0e9f757558d144b655c63afcea3a4e0a06f5
MONAI __file__: /home/exouser/Slicer/lib/Python/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.0.1
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 9.2.0
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: NOT INSTALLED or UNKNOWN VERSION.
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.
pandas version: 1.5.2
einops version: 0.6.0
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Using device:  cuda
error: [/home/exouser/Slicer/bin/SlicerApp-real] exit abnormally - Report the problem.

So crash happened on this as well.

@dcommander
Copy link
Member

If I had any confidence that it would be reproducible on my machine, I would try it, but I can't afford to spend several hours on a wild goose chase. The best suggestion I have at the moment is to pay my hourly consulting rate and let me log in and diagnose it remotely using a clone of your cloud computing image.

@dcommander
Copy link
Member

Is there any new information on this issue?

@muratmaga
Copy link
Author

All I can say, it continues to crash. This is with vgl 3.1-20230315, Nvidia Driver Version: 525.85.05, and torch 2.1.0+cu118

When we need to run models, we start Slicer without the vglrun.

exouser@tvnc:~/Slicer-5.4.0-linux-amd64$ vglrun -d egl ./Slicer 
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: too many profiles
Switch to module:  "Welcome"
Switch to module:  "MEMOS"
"Volume" Reader has successfully read the file "/home/exouser/Downloads/IMPC_sample_data.nrrd" "[0.13s]"
Checking python dependencies
Requirement already satisfied: pillow in ./lib/Python/lib/python3.9/site-packages (10.0.0)

[notice] A new release of pip is available: 23.1.2 -> 23.3.1
[notice] To update, run: python-real -m pip install --upgrade pip
"Color" Reader has successfully read the file "/home/exouser/Slicer-5.4.0-linux-amd64/slicer.org/Extensions-31938/MEMOS/lib/Slicer-5.4/qt-scripted-modules/Resources/Support/KOMP2.ctbl" "[0.01s]"
Using device:  0
MONAI version: 0.9.0
Numpy version: 1.25.1
Pytorch version: 2.1.0+cu118
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: af0e0e9f757558d144b655c63afcea3a4e0a06f5
MONAI __file__: /home/exouser/Slicer-5.4.0-linux-amd64/lib/Python/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 5.1.0
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 10.0.0
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.16.0+cu118
tqdm version: NOT INSTALLED or UNKNOWN VERSION.
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: NOT INSTALLED or UNKNOWN VERSION.
pandas version: NOT INSTALLED or UNKNOWN VERSION.
einops version: 0.7.0
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Using device:  cuda
error: [/home/exouser/Slicer-5.4.0-linux-amd64/bin/SlicerApp-real] exit abnormally - Report the problem.

@muratmaga
Copy link
Author

muratmaga commented Aug 22, 2024

Slight update. with 535 series driver, we now get an error message instead Slicer exit abnormally.

Unable to load any of {libcudnn_graph.so.9.1.0, libcudnn_graph.so.9.1, libcudnn_graph.so.9, libcudnn_graph.so}
Invalid handle. Cannot load symbol cudnnCreate
error: [/home/exouser/Slicer/bin/./python-real] exit abnormally - Report the problem.

Also another change is this used to unique to the grid driver. But now it replicates with 535 drivers provided by the Ubuntu. Specifics of the system:

Ubuntu 22.04
Nvidia driver version: 535.183.01
vgl version: 3.1.1
cuda version: 11.8
torch version: 2.4.0

@dcommander
Copy link
Member

I wonder if it's related to #209.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants