-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Example use of NAMD CudaGlobalMaster interface #783
base: master
Are you sure you want to change the base?
Conversation
I did a performance benchmark comparing this PR with the traditional GlobalMaster interface. The test system had 86,550 atoms, with two RMSD CVs (both of them have 1,189 atoms) and a harmonic restraint defined. The integration timestep was 4ns, and the simulations were completed in 50,000 steps. Two CPU threads were used. The GPU is RTX 3060 laptop.
|
Another issue: It is a bit strange that this plugin of CudaGlobalMaster loads the |
This commit uses a custom allocator for the containers of positions, applied forces, total forces, masses and charges. The custom allocator can ensure that the vectors are allocated on host-pinned memory, so that the CUDA transpose kernels can directly transpose and copy the data from GPU to host, which reduces the data moving.
With the optimization in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/402 and the fix of dlopen in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/400, I have carried out another benchmark using the RMSD test case that @giacomofiorin sent to me. The system has 315,640 atoms, and the RMSD CV involves 41,836 atoms. The simulations were run on AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000.
|
Use Cuda Allocator
The benchmark above does not enable PME. I have implemented CUDA allocator for the vectors in Colvars (also with some optimizations on the NAMD side), and conducted the benchmark again with PME and on more platforms: AMD Ryzen Threadripper PRO 3975WX + NVIDIA RTX A6000
Intel(R) Xeon(R) Platinum 8168 CPU + 2xTesla V100-SXM3-32GB
ARM Neoverse-V2 + NVIDIA GH200
|
Hi @giacomofiorin and @jhenin ! This is an example use of the NAMD's CudaGlobalMaster interface with Colvars. The data exchanged with CudaGlobalMaster are supposed to be GPU-resident. Due to Colvars' limitations the implementation still have to copy the data from GPU to CPU, but I think it is the first step to make Colvars GPU-resident (or at least a test-bed for the GPU porting).
Compilation
To test the new interface, you need to run the following commands to build the interface with Colvars as a shared library:
Currently the dependencies include NAMD, the Colvars itself, and CUDA runtime. Also, due to recent NAMD changes regarding to the unified reductions, the CudaGlobalMaster is broken, you need to switch to the fix_cudagm_reduction branch to build the interface and NAMD itself (or wait for https://gitlab.com/tcbgUIUC/namd/-/merge_requests/398 being merged).
Example usage
The example NAMD input file can be found in
namd_cudaglobalmaster/example/alad.namd
, which dynamically loads the shared library built above and runs an OPES simulation along the two dihedral angles of the alanine dipeptide.Limitations
init_atom
andclear_atom
;colvarproxy_namd.*
, but I am still not sure why some functions likeupdate_target_temperature()
,update_engine_parameters()
,setup_input()
andsetup_output()
seem to be called multiple times there;xxxyyyzzz
format as discussed in GPU preparation work #652. However, Colvars still usesxyzxyzxyz
so I have to transform the arrays in the interface code;namd_cudaglobalmaster/CMakeLists.txt
should add Colvars byadd_subdirectory
instead of finding all source files directly.