[RFC] Example use of NAMD CudaGlobalMaster interface #783

HanatoK · 2025-03-20T20:59:10Z

Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).

Compilation

To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:

cd namd_cudaglobalmaster/
mkdir build
cd build
cmake -DNAMD_DIR=<YOUR_NAMD_SOURCE_CODE_DIRECTORY> ../
make -j2

Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).

Example usage

The example NAMD input file can be found in namd_cudaglobalmaster/example/alad.namd, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.

Limitations

Colvars does not provide any interface to notify the MD engines that it has finished changing the atom selection, so I have to reallocate all buffers in case of init_atom and clear_atom;
Most of the interface code are copied from the existing colvarproxy_namd.*, but I am still not sure why some functions like update_target_temperature(), update_engine_parameters(), setup_input() and setup_output() seem to be called multiple times there;
CudaGlobalMaster copies the atoms to the buffers in xxxyyyzzz format as discussed in GPU preparation work #652. However, Colvars still uses xyzxyzxyz so I have to transform the arrays in the interface code;
SMP is disabled, as it is conflicted with the goal of GPU-resident;
volmap is not available;
More tests are needed;
The build file namd_cudaglobalmaster/CMakeLists.txt should add Colvars by add_subdirectory instead of finding all source files directly.

HanatoK · 2025-03-21T20:25:05Z

I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.

Simulation	Speed (ns/day)
No Colvars	62.9669
Colvars with this interface	62.4366
Colvars with GlobalMaster	57.365

HanatoK · 2025-03-22T22:37:08Z

Another issue:

It is a bit strange that this plugin of CudaGlobalMaster loads the colvarproxy related symbols from NAMD, instead of the code Colvars source compiled with it.

This commit uses a custom allocator for the containers of positions, applied forces, total forces, masses and charges. The custom allocator can ensure that the vectors are allocated on host-pinned memory, so that the CUDA transpose kernels can directly transpose and copy the data from GPU to host, which reduces the data moving.

HanatoK · 2025-03-24T20:31:11Z

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation	Speed (ns/day)
No Colvars	32.6442
Colvars with this interface	27.1574
Colvars with GlobalMaster	14.2476

memory

CPU calculation

Use Cuda Allocator

HanatoK · 2025-03-25T20:47:14Z

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms:

AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000

Simulation	Speed (ns/day)
No Colvars	27.8499
Colvars with CudaGlobalMaster	26.871
Colvars with GlobalMaster	13.9529

Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB

Simulation	Speed (ns/day)
No Colvars	54.1266
Colvars with CudaGlobalMaster	38.1852
Colvars with GlobalMaster	7.22716

ARM Neoverse-V2 + NVIDIA GH200

Simulation	Speed (ns/day)
No Colvars	78.6648
Colvars with CudaGlobalMaster	67.9719
Colvars with GlobalMaster	22.5077

HanatoK added 5 commits March 20, 2025 15:23

Example use of NAMD CudaGlobalMaster interface

a96deaf

refactor: use CUDA to transpose the positions

ebf5810

fix: correct the atom id setup and update the masses and charges

c31c033

chore: cleanups

e7cdcd8

chore: remove the debugging log

0e4b8a2

fix: use the correct path for Colvars source code

667ec3c

HanatoK mentioned this pull request Mar 23, 2025

Support of custom allocator for atoms_positions, atoms_total_forces and atoms_new_colvar_forces #784

Open

HanatoK added 3 commits March 24, 2025 13:16

prof: use nvtx to profile Colvars

2185652

opt: remove the unnecessary stream synchronization

2f74cf2

HanatoK added 5 commits March 25, 2025 09:31

opt: transpose to the device arrays at first and then copy to pinned

c4178d6

memory

Fix the deallocate of CudaHostAllocator

d8dd54f

Fix the compilation with C++11

85c0deb

opt: use the onBuffersUpdated interface to separate the data moving from

9594d67

CPU calculation

Merge pull request #4 from HanatoK/api_cudagm_cuda_allocator

59b9d26

Use Cuda Allocator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Example use of NAMD CudaGlobalMaster interface #783

[RFC] Example use of NAMD CudaGlobalMaster interface #783

HanatoK commented Mar 20, 2025

HanatoK commented Mar 21, 2025

HanatoK commented Mar 22, 2025

HanatoK commented Mar 24, 2025

HanatoK commented Mar 25, 2025

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Are you sure you want to change the base?

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Conversation

HanatoK commented Mar 20, 2025

Compilation

Example usage

Limitations

HanatoK commented Mar 21, 2025

HanatoK commented Mar 22, 2025

HanatoK commented Mar 24, 2025

HanatoK commented Mar 25, 2025