-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vgl 3.0.90-20221122 crashing torch on ubuntu 22.04 #227
Comments
Does it segfault? If so, can you get a stack trace to determine where it is crashing? (NOTE: Running the application with Does Slicer otherwise work with It's been my experience that the GPU hardware rarely matters when diagnosing VirtualGL-related crashes, but the driver version might. Have you tested the non-GRID version of 525.60.xx on the local machine? Have you tried building VirtualGL from source on the cloud server? Our official binaries are built on CentOS 7, so perhaps something changed in one of the ABIs that is causing difficulties. Have you verified that you can run other applications, such as GLXspheres, using |
It would also be worth testing whether this same crash occurs with VirtualGL 3.0.2. That will tell me whether it is caused by a new feature in VirtualGL 3.1 beta. |
Torch is a ML framework in python (also called pytorch). https://pytorch.org/ yes, Slicer works completely find with vglrun -d egl (i.e for regular 3D rendering using GPU) and , and all other vglrun sanity checks (glxspheres etc) are fine too. Unfortunately I cannot install the non-grid driver as those are baked into the image the cloud provider provides (I tried once, and everything went haywire). I will try the remaining suggestions and report back later. |
I suggested that you try the non-GRID driver on the local machine, not on the cloud server. Since you are unable to reproduce the issue on the local machine, perhaps upgrading the driver to a similar version as the driver on the cloud server will cause the issue to occur. This is all in the interest of collecting data points to isolate the issue's cause. If upgrading the driver on the local machine causes the issue to occur, then it is likely that I will also be able to reproduce the issue by upgrading the driver on my machine. If the issue always occurs with VirtualGL 3.1 beta and not with 3.0.2, then it is likely a regression caused by a feature in 3.1 beta, and it is also likely that I will be able to reproduce it. If, however, the issue is likely specific to the cloud server, then there is no sense in me wasting my time trying to repro it locally. |
Also, please specify exactly how the application is crashing. Searching the issues, I came across another one related to PyTorch and VGL that has to do with certain libraries not being picked up if they are linked with an RPATH of $ORIGIN. |
Hi there! @dcommander -- I work with the cloud system in question here (Jetstream2), and I figured I might be able to jump in and give a little context about the setup. We have a number of A100s in our GPU compute nodes that are provisioned to users through VMs; because not everyone is interested in using an entire A100, though, we use NVIDIA's proprietary GRID drivers to effectively "slice up" an A100 into 5 parts. For example, when a user spins up a VM, they may choose a small GPU "flavor." In that case, the hypervisor will "slice up" a physical GPU and pass through to the VM a smaller virtual GPU (vGPU); on the backend this vGPU only represents ~1/5th of an A100's compute and VRAM, but the virtual machine effectively sees it as one physical card on a PCI interface. This is why it is impossible to install a non-GRID driver on the cloud machine (VM); NVIDIA's proprietary GRID drivers will recognize the vGPU and understand how to properly use it, while a non-GRID driver is only really intended for communicating with an entire card that's physically connected to the machine. Unfortunately, NVIDIA's licensing terms for the GRID drivers and vGPU solutions in general are extremely restrictive. Our sysadmins receive driver installer packages directly from NVIDIA and we are unable to distribute them to anyone else; in other words, we can provide an OS image with the drivers pre-installed, but we cannot give a user the Hopefully this gives a bit of context. Since the setup on Jetstream2 differs so much from what a local machine would have (one GPU directly hooked into one machine with a "normal" driver), unless anyone is able to reproduce the issue on their local machine, comparing functionality on locals versus the cloud VM is a bit of an "apples to oranges" scenario. |
@zacharygraber I understand all of that. I've been developing VirtualGL for nearly 20 years, and it is my experience that, when nVidia drivers cause a problem with VirtualGL, it is because of something in nVidia's libGL implementation, which is abstracted from the hardware. Thus, it might be a useful data point to test a non-GRID installation of 525.60.xx on the local machine. However, that assertion is coming from a place of having no information regarding how the application is crashing, because @muratmaga has not yet provided me with that information. It's entirely possible that this is the same issue as #107, in which case it is a known issue with PyTorch that has to be worked around when using that framework with VGL. (There's even an application recipe for it in the VirtualGL User's Guide: https://rawcdn.githack.com/VirtualGL/virtualgl/3.1beta1/doc/index.html#hd0015.) In other words, if you guys want more intelligent suggestions from me, then please provide more information about the problem. Otherwise, I'm shooting in the dark. |
@dcommander I didn't have the time to try some of the more involved things you asked such as trying grid driver on our local machine (which I can't do it as we constantly use it for many tasks) and debugging. But I did try things with vgl 3.0.2, and crash replicates. So it is not something recently introduced with the beta version. I also tried the recipe for issue #107 and this is what I get:
So crash happened on this as well. |
If I had any confidence that it would be reproducible on my machine, I would try it, but I can't afford to spend several hours on a wild goose chase. The best suggestion I have at the moment is to pay my hourly consulting rate and let me log in and diagnose it remotely using a clone of your cloud computing image. |
Is there any new information on this issue? |
All I can say, it continues to crash. This is with vgl 3.1-20230315, Nvidia Driver Version: 525.85.05, and torch 2.1.0+cu118 When we need to run models, we start Slicer without the vglrun.
|
Slight update. with 535 series driver, we now get an error message instead Slicer exit abnormally.
Also another change is this used to unique to the grid driver. But now it replicates with 535 drivers provided by the Ubuntu. Specifics of the system: Ubuntu 22.04 |
I wonder if it's related to #209. |
We are using vgl on a cloud server with A100 gpu (grid driver 525.60.13) with 3D Slicer with egl backend.
If we invoke the slicer without vglrun, our torch based application works fine. If we start the Slicer via
vglrun -d egl ./Slicer
a little after the torch starts loading the trained model, application crashes. There is no error message (neither in Slicer not in vglrun) that I can locate.
I am assuming it is a weird combination of the driver being used and VGL, as we can't replicate this crash on our local machine (though we use ubuntu 20.04, not 22.04, and we have RTX A4000 not A100).
I would appreciate some pointers on how to troubleshoot this. Thanks.
The text was updated successfully, but these errors were encountered: