Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Example use of NAMD CudaGlobalMaster interface #783

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

HanatoK
Copy link
Member

@HanatoK HanatoK commented Mar 20, 2025

Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).

Compilation

To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:

cd namd_cudaglobalmaster/
mkdir build
cd build
cmake -DNAMD_DIR=<YOUR_NAMD_SOURCE_CODE_DIRECTORY> ../
make -j2

Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).

Example usage

The example NAMD input file can be found in namd_cudaglobalmaster/example/alad.namd, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.

Limitations

  • Colvars does not provide any interface to notify the MD engines that it has finished changing the atom selection, so I have to reallocate all buffers in case of init_atom and clear_atom;
  • Most of the interface code are copied from the existing colvarproxy_namd.*, but I am still not sure why some functions like update_target_temperature(), update_engine_parameters(), setup_input() and setup_output() seem to be called multiple times there;
  • CudaGlobalMaster copies the atoms to the buffers in xxxyyyzzz format as discussed in GPU preparation work #652. However, Colvars still uses xyzxyzxyz so I have to transform the arrays in the interface code;
  • SMP is disabled, as it is conflicted with the goal of GPU-resident;
  • volmap is not available;
  • More tests are needed;
  • The build file namd_cudaglobalmaster/CMakeLists.txt should add Colvars by add_subdirectory instead of finding all source files directly.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 21, 2025

I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.

Simulation Speed (ns/day)
No Colvars 62.9669
Colvars with this interface 62.4366
Colvars with GlobalMaster 57.365

@HanatoK
Copy link
Member Author

HanatoK commented Mar 22, 2025

Another issue:

It is a bit strange that this plugin of CudaGlobalMaster loads the colvarproxy related symbols from NAMD, instead of the code Colvars source compiled with it.

HanatoK added 3 commits March 24, 2025 13:16
This commit uses a custom allocator for the containers of positions,
applied forces, total forces, masses and charges. The custom allocator
can ensure that the vectors are allocated on host-pinned memory, so that
the CUDA transpose kernels can directly transpose and copy the data from
GPU to host, which reduces the data moving.
@HanatoK
Copy link
Member Author

HanatoK commented Mar 24, 2025

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

@HanatoK
Copy link
Member Author

HanatoK commented Mar 25, 2025

With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.

Simulation Speed (ns/day)
No Colvars 32.6442
Colvars with this interface 27.1574
Colvars with GlobalMaster 14.2476

The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms:

AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000

Simulation Speed (ns/day)
No Colvars 27.8499
Colvars with CudaGlobalMaster 26.871
Colvars with GlobalMaster 13.9529

Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB

Simulation Speed (ns/day)
No Colvars 54.1266
Colvars with CudaGlobalMaster 38.1852
Colvars with GlobalMaster 7.22716

ARM Neoverse-V2 + NVIDIA GH200

Simulation Speed (ns/day)
No Colvars 78.6648
Colvars with CudaGlobalMaster 67.9719
Colvars with GlobalMaster 22.5077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant