Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU support with CUDA build #114

Closed
sirmarcel opened this issue Apr 29, 2024 · 10 comments
Closed

CPU support with CUDA build #114

sirmarcel opened this issue Apr 29, 2024 · 10 comments
Assignees

Comments

@sirmarcel
Copy link

Hi! I've built sphericart with CUDA support (with pip install .[torch] in a clone of this repo). Now, whenever I run tests on a compute node without GPU, the CPU fallback appears to not be working, and the program crashes with

CUDA error at /home/langer/software/sphericart/sphericart-torch/sphericart/src/sphericart_cuda.cu:42 - no CUDA-capable device is detected

Apparently, sphericart is trying to go through the GPU code path despite all involved tensors being on the cpu device. This in in contrast with the docs, which state "Depending on the device the tensor is stored on, [...], the calculations will be performed [...] using the CPU or CUDA implementation."

It'd be nice to fix this, as this makes it rather annoying to debug when a GPU is not readily available.

Cheers!

@frostedoyster
Copy link
Collaborator

I checked and I managed to reproduce the bug

@nickjbrowning
Copy link
Collaborator

I’ll try to get this fixed within the next few hours, will keep you updated.

@nickjbrowning
Copy link
Collaborator

nickjbrowning commented Apr 29, 2024

@sirmarcel Could you check that this PR resolves this particular issue for you?

#118

@sirmarcel
Copy link
Author

Yep, I will check tomorrow. Thanks for the fast turnaround!

@sirmarcel
Copy link
Author

sirmarcel commented Apr 30, 2024

Okay, this is a weird one. If I directly import the calculator from sphericart, this fix works.

In my own class, which wraps it, I get strange behaviour. My tests fail with stuff like:

double free or corruption (out)
Aborted (core dumped)

Then I tried to write a minimal example:

import torch
from torch.nn import Module

from sphericart.torch import SphericalHarmonics as Sph

class SphericalHarmonics(Module):
 
    def __init__(self, max_angular):
        """Initialise SphericalHarmonics.

        Args:
            max_angular (int): The maximum spherical harmonic order ``l`` to compute.
        """
        super().__init__()

        self.max_angular = max_angular

        self.m_per_l = [2 * l + 1 for l in range(max_angular + 1)]

        self.sph = Sph(l_max=self.max_angular, normalized=True)


    def forward(self, R):
        """Compute spherical harmonics for input vectors.

        Args:
            R (Tensor): Input vectors of shape ``[pair, 3]``.

        Returns:
            Spherical harmonics of shape
                ``[pair, sum([2*l+1 for i in range(max_angular+1)])]``.

        """
        # R: [pair, 3 (x,y,z)]
        # todo: make special case for "mps" backend (need to move to `cpu`)
        return self.sph.compute(R)  # -> [pair, (l=0 m=0) (l=1 m=-1) (l=1 m=0) ...]

og = Sph(3)

sph = SphericalHarmonics(3)

a = torch.rand(10, 3)

print(f"og out: {og.compute(a)}")
print(f"wrapped out: {sph(a)}")

This, fascinatingly, prints the correct output but then fails:

og out: ... (correct-looking)
wrapped out: ... (correct-looking)
CUDA error at /home/langer/software/sphericart/sphericart-torch/sphericart/src/sphericart_cuda.cu:84 - no CUDA-capable device is detected

I'd guess something weird happens if you store a reference to the calculator inside some other python construct? No idea.

EDIT:

This also happens on GPU (i.e. moving a to cuda above). I get errors looking like:

free(): double free detected in tcache 2
Aborted (core dumped)

@nickjbrowning
Copy link
Collaborator

OK I'll take another look at this tomorrow and do a deeper dive on whats happening. Thanks for the detailed bug report!

@sirmarcel
Copy link
Author

Thanks @nickjbrowning !

@nickjbrowning
Copy link
Collaborator

nickjbrowning commented May 3, 2024

Hey @sirmarcel

I've just made another commit to #118 that should fix this issue. There was some constructor/destructor juggling that was happening previously since we instantiate the spherical harmonics (cuda) class with a default constructor, and the destructor was creating some oddities.

I've now changed this to be pointer-based which allows much cleaner memory management and should fix the double-free that was occurring.

I've ran your code sample and it works both for export CUDA_VISIBLE_DEVICES=0 and export CUDA_VISIBLE_DEVICES=.

@sirmarcel
Copy link
Author

Hi @nickjbrowning , sorry for the delay -- I've got testing this on the todo list, but so far haven't had any luck getting an interactive node on our GPU cluster. I'll update you as soon as I get it built + tested. Thanks so much!

@sirmarcel
Copy link
Author

Hi -- built it and ran my test suite. Looks good. No crashes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants