Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit GPU memory usage? #75

Open
sef43 opened this issue Jan 9, 2024 · 8 comments
Open

Limit GPU memory usage? #75

sef43 opened this issue Jan 9, 2024 · 8 comments

Comments

@sef43
Copy link

sef43 commented Jan 9, 2024

Hello,

When running on a GPU that might be doing something else I am sometimes seeing out of memory errors:
CUDA Error of GINTint2e_jk_kernel: out of memory

Is it possible to specify a hard limit on the amount of memory used by these kernels?

@wxj6000
Copy link
Collaborator

wxj6000 commented Jan 9, 2024

GPU memory is mostly allocated via CuPy. You can set the memory limit via CuPy if you hope GPU can do something else. https://docs.cupy.dev/en/stable/user_guide/memory.html#limiting-gpu-memory-usage

Although GINT* kernels do not allocate global memory explicitly, those kernels allocate a lot of local memory for high angular momentums. Those local memory are eventually allocated on global memory. So for high angular momentums, you probably still have the 'out of memory' issue.

@sef43
Copy link
Author

sef43 commented Jan 9, 2024

thank you for the explanation

@sef43
Copy link
Author

sef43 commented Oct 2, 2024

Hello, I am reopening this issue.

I have found that if I turn on CUDA_MPS and limit the number of active threads with this command:
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50
I find that a calculation that usually fails with the CUDA Error of GINTint2e_jk_kernel: out of memory will succeed (taking only 1.5x longer, not 2x longer)

My understanding is that this reduces the local/shared memory in use at once, stopping the errors, at the expense of runtime.

Is it possible to do a similar modification at runtime, or compile time, in the code?

Maybe these values:?

// threads for GPU
#define THREADSX 16
#define THREADSY 16
#define THREADS (THREADSX * THREADSY)
#define MAX_STREAMS 16

@wxj6000
Copy link
Collaborator

wxj6000 commented Oct 3, 2024

This is a good suggestion. If you turn off some threads, there is no need to allocate local memory for those threads. We can take it as one of the possible solutions.

@Tillsten
Copy link

Hi,

First of all, I’m absolutely blown away by the performance of GPU4PySCF—thank you for this amazing tool!

I have a beginner question regarding an issue I encountered. I’m running a torsional scan similar to the provided example, and it generally works well for several iterations. However, at some point, I get the following error:

CUDA Error of GINTint2e_jk_kernel: out of memory

This happens on our cluster with an A100 40GB GPU. Since my molecule isn’t very large (24 atoms) and it runs fine for multiple iterations before failing, I’m a bit confused. Is there a way to free up memory between iterations to prevent this issue?

Full code:

import time

import pyscf
from pyscf import lib
from pyscf.geomopt.geometric_solver import optimize

from gpu4pyscf.dft import rks

atom = '''
  C       0.724002      1.135021     -0.907355
  O      -0.356123      0.965447     -0.024473
  C      -0.744599     -0.386333      0.152087
  C       0.396187     -1.157444      0.792032
  O       0.011790     -2.507030      0.899684
  C       1.644622     -1.020028     -0.053519
  C       1.948759      0.441464     -0.321131
  N       3.069963      0.492310     -1.261050
  O       2.695457     -1.654767      0.636744
  C      -1.987816     -0.375699      1.005567
  O      -3.055730      0.286385      0.366128
  O       0.929643      2.485897     -1.104082
  H      -0.977695     -0.828245     -0.823532
  H       0.596763     -0.736811      1.783422
  H       1.468667     -1.522896     -1.009604
  H       2.212953      0.934306      0.618699
  H       0.481425      0.707737     -1.884177
  H       1.156487      2.903388     -0.265770
  H       3.435645     -1.785156      0.038141
  H       0.756639     -3.006876      1.245376
  H      -1.757549      0.093793      1.965767
  H      -2.306214     -1.397909      1.189633
  H      -2.790237      1.193944      0.194295
  N       3.757659      1.504438     -1.211762
  N       4.455331      2.386355     -1.246447
'''

xc = 'B3LYP'
bas = '6-311++G(2d,2p)'

scf_tol = 1e-10
max_scf_cycles = 200
screen_tol = 1e-14
grids_level = 3
mol = pyscf.M(atom=atom, basis=bas, max_memory=120000)

mol.verbose = 1
mf_GPU = rks.RKS(mol, xc=xc).density_fit()
mf_GPU.grids.level = grids_level

mf_GPU.conv_tol = scf_tol
mf_GPU.max_cycle = max_scf_cycles
mf_GPU.screen_tol = screen_tol

gradients = []

start_time = time.time()
# Content of geometric_scan.txt:
# $scan
# dihedral 1 7 8 24 90 -240 20
mol_eq = optimize(
    mf_GPU,
    maxsteps=500000000,
    constraints='geometric_scan.txt',  # atom index is 1-based in this file
)
print("Optimized coordinate:")
print(mol_eq.atom_coords())
print(time.time() - start_time)

@Tillsten
Copy link

This shows the memory useage of a run:
Image

@wxj6000
Copy link
Collaborator

wxj6000 commented Jan 31, 2025

@Tillsten Thank you for the feedback!

The geometry optimization is converged in 10 iterations on my side. It took about 80 seconds on V100-32GB. I was using the constraints commented in your script. I assumed you were using the same.

Most GPU memory is released between optimization iterations. As shown in the above figure, the GPU memory usage is almost constant in the first few iterations. However, it blew up at 14:13:30. It is probably due to the failure of optimization. Can you share the log of GeomeTRIC?

> === End Optimization Info ===
/usr/local/lib/python3.9/dist-packages/pyscf/dft/libxc.py:512: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
  warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
Step    0 : Gradient = 5.172e-03/1.051e-02 (rms/max) Energy = -775.8066209104
Hessian Eigenvalues: 2.30000e-02 2.30000e-02 2.30000e-02 ... 5.52941e-01 9.33540e-01 1.53696e+00
Step    1 : Displace = 3.914e-02/1.087e-01 (rms/max) Trust = 1.000e-01 (=) Grad = 1.998e-03/3.784e-03 (rms/max) E (change) = -775.8082899754 (-1.669e-03) Quality = 0.908
Hessian Eigenvalues: 2.12970e-02 2.30000e-02 2.30000e-02 ... 5.52878e-01 9.31823e-01 1.52869e+00
Step    2 : Displace = 1.072e-02/2.532e-02 (rms/max) Trust = 1.414e-01 (+) Grad = 9.165e-04/1.778e-03 (rms/max) E (change) = -775.8085107051 (-2.207e-04) Quality = 1.433
Hessian Eigenvalues: 1.05437e-02 2.30000e-02 2.30000e-02 ... 5.53047e-01 9.37131e-01 1.54934e+00
Step    3 : Displace = 1.745e-02/4.571e-02 (rms/max) Trust = 2.000e-01 (+) Grad = 9.438e-04/2.310e-03 (rms/max) E (change) = -775.8086666702 (-1.560e-04) Quality = 1.273
Hessian Eigenvalues: 5.85061e-03 2.29966e-02 2.30000e-02 ... 5.53238e-01 9.35494e-01 1.55559e+00
Step    4 : Displace = 1.261e-02/3.691e-02 (rms/max) Trust = 2.828e-01 (+) Grad = 8.230e-04/1.613e-03 (rms/max) E (change) = -775.8087314393 (-6.477e-05) Quality = 1.324
Hessian Eigenvalues: 4.13613e-03 2.29893e-02 2.30000e-02 ... 5.53154e-01 9.37752e-01 1.53192e+00
Step    5 : Displace = 7.848e-03/2.525e-02 (rms/max) Trust = 3.000e-01 (+) Grad = 4.205e-04/9.524e-04 (rms/max) E (change) = -775.8087625458 (-3.111e-05) Quality = 1.283
Hessian Eigenvalues: 3.88309e-03 2.22879e-02 2.30000e-02 ... 5.53159e-01 9.39416e-01 1.54060e+00
Step    6 : Displace = 3.049e-03/8.701e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 2.010e-04/5.087e-04 (rms/max) E (change) = -775.8087720689 (-9.523e-06) Quality = 1.448
Hessian Eigenvalues: 3.88031e-03 1.49010e-02 2.29995e-02 ... 5.53376e-01 9.34101e-01 1.55097e+00
Step    7 : Displace = 2.711e-03/5.102e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.335e-04/2.866e-04 (rms/max) E (change) = -775.8087761481 (-4.079e-06) Quality = 1.594
Hessian Eigenvalues: 3.83858e-03 8.80474e-03 2.29987e-02 ... 5.53288e-01 9.35053e-01 1.53620e+00
Step    8 : Displace = 2.335e-03/5.234e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 1.060e-04/2.560e-04 (rms/max) E (change) = -775.8087779904 (-1.842e-06) Quality = 1.651
Hessian Eigenvalues: 3.72867e-03 6.39808e-03 2.29962e-02 ... 5.53298e-01 9.40494e-01 1.54148e+00
Step    9 : Displace = 1.578e-03/3.654e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 6.898e-05/1.693e-04 (rms/max) E (change) = -775.8087786849 (-6.945e-07) Quality = 1.355
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Step   10 : Displace = 7.862e-04/1.669e-03 (rms/max) Trust = 3.000e-01 (=) Grad = 3.845e-05/9.810e-05 (rms/max) E (change) = -775.8087787514 (-6.648e-08) Quality = 0.291
Hessian Eigenvalues: 3.61099e-03 5.70565e-03 2.19036e-02 ... 5.53399e-01 9.36808e-01 1.55393e+00
Converged! =D

@Tillsten
Copy link

I attachted a log from a run.
slurm-950146.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants