Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified memory support on GH200 Grace Hopper #2306

Closed
4 tasks done
bebora opened this issue Oct 21, 2024 · 6 comments
Closed
4 tasks done

Unified memory support on GH200 Grace Hopper #2306

bebora opened this issue Oct 21, 2024 · 6 comments
Assignees
Labels
bug Something isn't working verify and close

Comments

@bebora
Copy link
Contributor

bebora commented Oct 21, 2024

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

The NVIDIA GH200 Grace Hopper Superchip is promoted as being capable of utilizing the entire system memory for GPU tasks (NVIDIA blog). However, CUDA-Q does not use the full system memory when specifying the nvidia target.

Steps to reproduce the bug

Create the following source file ghz.cpp:

#include <cudaq.h>

// Define a quantum kernel with a runtime parameter
struct ghz {
    auto operator()(const int N) __qpu__ {

        // Dynamically sized vector of qubits
        cudaq::qvector q(N);
        h(q[0]);
        for (int i = 0; i < N - 1; i++) {
            x<cudaq::ctrl>(q[i], q[i + 1]);
        }
        mz(q);
    }
};

int main(int argc, char *argv[]) {
    int qubits_count = 2;
    if (argc > 1) {
        qubits_count = atoi(argv[1]);
    }
    auto counts = cudaq::sample(/*shots=*/1000, ghz{}, qubits_count);

    if (!cudaq::mpi::is_initialized() || cudaq::mpi::rank() == 0) {
        counts.dump();
    }

    return 0;
}

Compile it as follows:
nvq++ ghz.cpp -o ghz.out --target nvidia
And then run it:
33 qubits: ./ghz.out 33nvidia-smi reports a VRAM usage of about 66400MiB
34 qubits: ./ghz.out 34 ❌:

terminate called after throwing an instance of 'ubackend::RuntimeError'
  what():  requested size is too big
Aborted (core dumped)

Expected behavior

I expect the GPU to be able to use system memory when necessary and simulate up to 35/36 qubits. Memory quickly becomes a limit in quantum simulations and a possible way to increase simulated qubits would be appreciated.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA Quantum version: Nightly on Docker container
  • Operating system: Ubuntu 22.04.2 LTS

Suggestions

I was looking at a Grace Hopper presentation from John Linford and noticed two details:

  1. The unified memory reportedly works on CUDA 12, but CUDA-Q is still using cuQuantum libraries for CUDA 11 as far as I know (or, in other words, I have not been able to run it without some specific CUDA 11 dependencies installed; see Installer not working as documented with CUDA 12 #1718 and ImportError: Invalid simulator requested: custatevec_fp32 #1096).
  2. Slide 66 reports that the regular cudaMalloc is not enough and suggests using cudaMallocManaged or malloc/mmap. I had a look at the cuQuantum repository and I saw some occurences of cudaMalloc in the code, but none of cudaMallocManaged.

Do you think GH200 systems will ever be able to fully utilize their memory for quantum simulation using CUDA-Q/cuQuantum? Would this hypothetical approach affect too much simulation performance?

@1tnguyen
Copy link
Collaborator

Hi @bebora,

To enable host memory usage, we need to set the CUDAQ_MAX_CPU_MEMORY_GB environment variable as described in the docs.

I'd also note a bug that we found related to this when an unbounded value is set.

@bebora
Copy link
Contributor Author

bebora commented Oct 22, 2024

Hi @1tnguyen, thank you for the good pointer. I have some new questions:

  1. Am I correct in stating that the simulation happens either fully on system memory or fully on GPU memory?

  2. I tried setting CUDAQ_MAX_CPU_MEMORY_GB and CUDAQ_MAX_GPU_MEMORY_GB together, but the executable crashes.
    $ CUDAQ_MAX_CPU_MEMORY_GB=200 CUDAQ_MAX_GPU_MEMORY_GB=20 ./ghz.out 32 => 'ubackend::RuntimeError' ... cudaErrorInvalidValue.
    The only GPU memory value found by me that does not cause the executable to crash in this situation is 1.
    Is this expected?

  3. I tried experimenting with CUDAQ_MAX_CPU_MEMORY_GB using the reported GHZ example, but the results baffle me:

$ CUDAQ_MAX_CPU_MEMORY_GB=128 ./ghz.out 34
{ 0000000000000000000000000000000000:254 0000000000000000000000000000000001:259 1111111111111111111111111111111110:231 1111111111111111111111111111111111:256 }
$ CUDAQ_MAX_CPU_MEMORY_GB=256 ./ghz.out 35
{ 00000000000000000000000000000000000:127 00000000000000000000000000000000001:124 00000000000000000000000000000000010:127 00000000000000000000000000000000011:135 11111111111111111111111111111111100:109 11111111111111111111111111111111101:119 11111111111111111111111111111111110:122 11111111111111111111111111111111111:137 }

I'm expecting to obtain strings of only 0s and 1s from a generalized GHZ state, but it seems that the first qubits are simulated correctly while the last ones are random. Why is this the case?

@schweitzpgi schweitzpgi added the needs triage Marks items that require a follow up for proper processing label Nov 19, 2024
@1tnguyen 1tnguyen added the bug Something isn't working label Nov 20, 2024
@1tnguyen
Copy link
Collaborator

Hi @bebora,

Here is a quick update on this issue.

In v0.9, we've identified and fixed:

(1) Proper handling of CUDAQ_MAX_GPU_MEMORY_GB.

This is related to point#2 in the above comment.

Could you please let us know whether the issue with CUDAQ_MAX_GPU_MEMORY_GB has been resolved?
e.g., checking the CUDAQ_MAX_CPU_MEMORY_GB=200 CUDAQ_MAX_GPU_MEMORY_GB=20 with 32 or 33 qubits.

(2) Proper handling of CUDAQ_MAX_CPU_MEMORY_GB w.r.t. actual available memory.

If the CUDAQ_MAX_CPU_MEMORY_GB is set to a value exceeds the available memory, the backend will cap it to the available memory size.
For example, with CUDAQ_LOG_LEVEL=info, you can now see a log like this

Requested host memory of 1024 GB exceeds the available memory size of 420 GB. Only use available memory.

when CUDAQ_MAX_CPU_MEMORY_GB is set to 1024.

Note: this was the primary case that we observed corrupted measurement results similar to what you saw in point#3 in the above comment.

Recently, I had access to a system setup where I could reproduce the corrupted output that is not related to free memory capacity.
We have identified the root cause and will put out a fix, which will be available in the nightly release channel before the next release.

In that same system, we've also found another issue, which we haven't observed before. Depending on your specific configuration (driver version, CUDA version, etc.), that issue may or may not happen.

@bebora
Copy link
Contributor Author

bebora commented Nov 21, 2024

Hi @1tnguyen, thanks for the explanation and your debugging efforts.

I can reply to your point#1:
I have just tried v0.9 on another system at my disposal, using the cu12-0.9.0 image.
I can confirm that setting both CUDAQ_MAX_CPU_MEMORY_GB and CUDAQ_MAX_GPU_MEMORY_GB to custom values does not cause the previously reported cudaErrorInvalidValue anymore.

@1tnguyen
Copy link
Collaborator

1tnguyen commented Dec 8, 2024

@bebora FYI, we've pushed a fix for the above issue.
The fix is available in CUDA-Q nightly images: nvcr.io/nvidia/nightly/cuda-quantum:cu12-latest and nvcr.io/nvidia/nightly/cuda-quantum:cu11-latest

@1tnguyen 1tnguyen added verify and close and removed needs triage Marks items that require a follow up for proper processing labels Dec 8, 2024
@bebora
Copy link
Contributor Author

bebora commented Dec 9, 2024

@1tnguyen I can confirm that GHZ does indeed work as intended with 34 and 35 qubits.

@bebora bebora closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working verify and close
Projects
None yet
Development

No branches or pull requests

3 participants