[RFC] Add an option to parallelize the inner loops instead of CVCs #780

HanatoK · 2025-03-11T20:32:32Z

As discussed in #700, this PR aims to provide the users an option (smp inner_loop) to parallelize the inner loops.

Currently this is only in early stage so I mark it as a draft PR.

namd/src/colvarproxy_namd.h

giacomofiorin · 2025-03-11T21:43:42Z

src/colvarmodule.cpp

+    if (smp == "cvcs" || smp == "on" || smp == "yes") {
+      proxy->smp_mode = colvarproxy_smp::smp_mode_t::cvcs;
+      cvm::log("SMP parallelism will be applied to Colvars components.\n");
+      cvm::log("  - SMP parallelism: enabled (num. threads = " + to_str(proxy->smp_num_threads()) + ")\n");
+    } else if (smp == "inner_loop") {
+      proxy->smp_mode = colvarproxy_smp::smp_mode_t::inner_loop;


I like the addition of an enum flag, which is certainly more useful than a yes/no flag moving forward.

However, the kewyords identifying its values, being user visible, need some careful choice. Once defined, you can't change them any more. I am specifically worried that "cvcs" is very rarely used in the doc (it's mostly used in the source code) and that the meaning of "inner_loop" would be easily mistaken in the case of O(NxM) variables.

In the doc we use "components", so that could be used here, or "colvar_components" if we don't mind a slightly long word.

Hi @giacomofiorin and @jhenin! Do you have any suggestions for the keywords?

This particular PR only adds support for running the inner loop of the OPES bias. All the data of other biases, as well as atoms within the components, are still processed sequentially. So a keyword that is at least specific to the biases (if not OPES itself) would be appropriate. Maybe bias_data or bias_inner?

How about supporting the following?

smp on, which is the current meaning, but we document more clearly that this is specifically for the "outer" loop over colvar components and biases

smb bias_xxx (new keyword), where you should document that the parallelization currently supports only OPES, and that turning it on will disable all other SMP schemes

smp off, where everything runs sequentially on a single thread

Then as we improve support we can add further customizations, but we can also preserve the on keyword value as the option to enable SMP in the largest combination of features. (Ideally, we could even have the code do quick benchmarks and enable the best configuration, but that's clearly outside this PR)

Although this PR touches the OPES code, it does not make the parallelization in OPES work immediately, because I have found that macros USE_CKLOOP and CMK_SMP when compiling colvarsbias_opes.C. I will have to solve this issue in later commits or PRs by adding functions to colvarproxy_smp to abstract the parallelization implementations from OpenMP and Charm++

This PR intends to support the parallelization scheme discussed in #700 (not only specifically OPES). I am going to parallelize the optimal alignment, RMSD, COM calculations and metadynamics, so I think a keyword like smp bias_xxx may be too specific.

In theory, the two parallelization schemes are not mutually exclusive, but making them mutually exclusive can simplify the implementation a lot, and make the further port to GPU easier.

I am not sure how moving the option into colvardeps works. Could you make a PR if you have time?

Yes, I could do a PR against this branch, but because it lives on your fork the PR would not show up here. How about a simple push since the two are closely related?

OK. You can push your commits here.

I just pushed a couple of commits. I did not touch the smp_mode_t yet, which is the way you set it, i.e. mutually exclusive schemes.

But with a feature flag internal to the bias, the OPES numerical implementation can test only for that flag specifically.

@giacomofiorin As for the option value names of smp, I don't have a better idea than inner_loop. I have tried to ask the LLM to see if there are any better options (https://www.perplexity.ai/search/i-am-revising-the-parallelizat-Unu3PHipRTWaBb0T29P5pA). It looks like that "fine_grained" may be a little bit better. Maybe I should revise the manual to provide a figure about the difference of the two parallelization schemes. How do you think of it?

HanatoK · 2025-03-17T19:48:30Z

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

HanatoK · 2025-03-17T20:43:05Z

The OPES parallelization implementation can be found in HanatoK#3. I will wait @giacomofiorin's commits here before merging that one.

giacomofiorin · 2025-03-17T20:53:42Z

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

No, indeed. I had originally thought that the two specialized functions smp_colvars_loop() and smp_biases_loop() could both be replaced by a generic function, which takes as its arguments the length of the loop and the "worker" function. Do you see any potential issues with that approach?

HanatoK · 2025-03-17T20:59:23Z

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

No, indeed. I had originally thought that the two specialized functions smp_colvars_loop() and smp_biases_loop() could both be replaced by a generic function, which takes as its arguments the length of the loop and the "worker" function. Do you see any potential issues with that approach?

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

giacomofiorin · 2025-03-17T22:14:58Z

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

Well, my preference would be to use that to support multiple SMP mechanisms in the proxy, as opposed to hard-coding pragmas for both OpenMP and Charm++ separately all over the library.

HanatoK · 2025-03-18T02:42:24Z

src/colvardeps.h

@@ -257,6 +257,8 @@ class colvardeps {
    f_cvb_scale_biasing_force,
    /// \brief whether this bias is applied to one or more ext-Lagrangian colvars
    f_cvb_extended,
+    /// Process this bias's data in parallel over multiple CPU threads
+    f_cvb_smp,


@giacomofiorin Should we further divide the SMP case into flags like f_cvb_smp_omp, f_cvb_smp_ckloop, f_cvb_smp_cuda and f_cvb_smp_sycl? What about the Colvars components? Is it possible to also have f_cvc_smp_*, or reuse the f_cvb_smp for CVCs?

Hi! Every such feature flag should be added manually, and their dependencies or conflicts written down explicitly. So it would be better to add a new feature only when it's about to be used.

HanatoK · 2025-03-18T15:39:02Z

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

Well, my preference would be to use that to support multiple SMP mechanisms in the proxy, as opposed to hard-coding pragmas for both OpenMP and Charm++ separately all over the library.

Originally I had the same idea as you, but after trying to write interfaces to abstract the differences between OpenMP and CkLoop in the proxy, I found that were difficult tasks even for a simple for-loop with reduction. Considering the porting to GPU in the future, it is even impossible to hide all the details behind the proxy. In other words, the problem is supporting multiple parallelization libraries in Colvars in case of smp inner_loop, so eventually we will have to conditionally compile different "kernels" for a component or bias.

My suggestion is to use different Colvars macros to indicate different parallelization libraries. For CkLoop, we can add a macro -DCOLVARS_SMP_CKLOOP when building with NAMD, -DCOLVARS_SMP_OPENMP when building with GROMACS and LAMMPS, and -DCOLVARS_SMP_CUDA when building with CUDA and the CUDA kernels are ready for components and biases. We then create new classes for components inheriting the existing serial implementations. For example, we can have colvar::orientation_ckloop and colvar::orientation_omp (or colvar::orientation with template arguments for SMP libraries) conditionally compiled and called based on the macros -DCOLVARS_SMP_CKLOOP and -DCOLVARS_SMP_OPENMP.

In summary, I think it is important to not only implement smp inner_loop but also provide a path for future GPU works. @giacomofiorin How do you think of it?

giacomofiorin · 2025-03-18T18:59:36Z

Hi @HanatoK! Yes, I absolutely agree that we cannot hide everything in the proxy (even if we would like to! 😆). This is particularly true for GPU support, where the movement of data between host and device is a critical step that the proxy can't do automatically. This issue makes GPU support more similar to message-passing parallelism than to SMP, in my opinion.

I also agree that any changes introduced now should aim to simplify future implementation work, as much as it is possible. In general, this goal can be achieved by making code more reusable. For the colvar components, I always thought that the compute functions could be broken into individual steps, each written as a template so that you could include it in the kernels' definition. Is that something similar to what you were thinking also?

Lastly, for the CPU-based SMP backends (i.e. OpenMP or CkLoop), I don't think that we necessarily need to interleave the computation and reduction at all costs. We could always accumulate partial sums locally for each threads, and then add all the partial sums sequentially. This is already what is currently done in the "outer loop", where multiple components are computed in parallel and then aggregated into the value of each colvars by thread 0. Aside from the GPU support (which we agree is more complex), do you see an issue with this that maybe I'm missing?

giacomofiorin · 2025-03-18T19:12:12Z

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

The two most recent commits at this time achieve this for the loop over colvar components. I plan on converting the other current main SMP loop, the one over the biases, shortly.

Do you know if the tracing calls to traceUserBracketEvent() need to be run within each thread-parallel region, or can they be run in right outside of it?

HanatoK · 2025-03-18T19:20:20Z

@giacomofiorin Thank you for your reply! I will not introduce further code changes into this PR. As for enabling the OPES parallelization, I will do that in another PR, and possibly refactor the colvarbias interface with a new virtual update function like virtual int update_smp(). As for smp inner_loop (or maybe I change it to innerLoop to match the Colvars naming convention), do you have any better suggestion that covers not only a specific bias but also all other biases and components? If you are fine with smp inner_loop, I can update the document to include a figure to explain the differences between parallelization scheme.

I have no idea about traceUserBracketEvent(). I don't use the tracing at all.

giacomofiorin · 2025-03-18T19:43:58Z

@giacomofiorin Thank you for your reply! I will not introduce further code changes into this PR.

Ok, so what should we do with this PR? Move it out of draft status, or work on it more after we get a clearer idea of the constraints coming from parallelizing more features?

In regard to moving forward with enabling the OPES parallelism, which you have largely done already, how about the following simple approach?

In simulations without OPES, we don't change anything, i.e. we keep looping over all colvar components and all biases in parallel.
In simulations with one OPES bias, we keep looping over colvar components in parallel, but run the biases one at a time, and the OPES bias itself in parallel. That is, we automatically adopt the inner_loop scheme for the biases whenever OPES is defined.

Then of course, once more features are parallelized internally (biases or components) we revisit the simple criterion above. What do you think?

HanatoK · 2025-03-18T19:50:28Z

In simulations without OPES, we don't change anything, i.e. we keep looping over all colvar components and all biases in parallel.

In simulations with one OPES bias, we keep looping over colvar components in parallel, but run the biases one at a time, and the OPES bias itself in parallel. That is, we automatically adopt the inner_loop scheme for the biases whenever OPES is defined.

Then of course, once more features are parallelized internally (biases or components) we revisit the simple criterion above. What do you think?

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

giacomofiorin · 2025-03-18T20:15:34Z

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

Well, that's actually very close to what I meant. If you are okay with using the f_cvb_smp feature, the SMP mode can default to the "inner loop" scheme when f_cvb_smp is available, or to the original scheme otherwise.

HanatoK · 2025-03-18T20:23:47Z

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

Well, that's actually very close to what I meant. If you are okay with using the f_cvb_smp feature, the SMP mode can default to the "inner loop" scheme when f_cvb_smp is available, or to the original scheme otherwise.

I am OK with it. Actually I thought for CUDA we could have a similar flag (#652 (reply in thread)).

As for the name inner_loop for the smp option, are you OK with it? I think we can complement the documentation in future PRs if we don't want to publicize the new option for now. smp inner_loop should be considered experimental anyway.

giacomofiorin · 2025-03-18T21:15:45Z

If "experimental" means off by default, how long do you think that would be the case? Once you have the OPES parallelism working, it would be great to enable it by default.

If your idea is to give people finer control over parallel features, maybe we allow the option to activate SMP at the level of individual objects? For example, thesmp flag of the OPES bias (internally represented as f_cvb_smp) could default to off at first, and enabling turns on the "inner loop" mode. Then once you are confident that the feature works, it could be on by default.

As far as the current SMP loops are concerned, these could continue to run over objects that have smp off (either by design or user choice). smp off could remain as an optional kill switch, in case of bugs.

How does that sound?

HanatoK · 2025-03-18T21:31:33Z

If "experimental" means off by default, how long do you think that would be the case? Once you have the OPES parallelism working, it would be great to enable it by default.

If your idea is to give people finer control over parallel features, maybe we allow the option to activate SMP at the level of individual objects? For example, thesmp flag of the OPES bias (internally represented as f_cvb_smp) could default to off at first, and enabling turns on the "inner loop" mode. Then once you are confident that the feature works, it could be on by default.

As far as the current SMP loops are concerned, these could continue to run over objects that have smp off (either by design or user choice). smp off could remain as an optional kill switch, in case of bugs.

How does that sound?

I realize that the logic is a bit complicated after introducing the smp option for all biases. In my opinion, the global switch smp inner_loop should turn on smp for all biases once they have parallelization implemented, except those with explicit smp off and I think this is how your current code works.

By saying "experimental" I mean that the new feature is not available in the documentation. The feature should be in experimental stage before the following issues are solved:

Refactoring the base classes of components and biases by adding virtual functions like calc_value_smp() and update_smp() , or adding new classes like colvarcomp_smp and colvarbias_smp (the former looks better but I am not quite sure);
Implementing SMP parallelization for the calculations in colvarmodule::atom and colvarmodule::rotation;
You are satisfied with the name "inner_loop" and agree to keep it in the documentation.

giacomofiorin · 2025-03-19T17:34:58Z

I realize that the logic is a bit complicated after introducing the smp option for all biases. In my opinion, the global switch smp inner_loop should turn on smp for all biases once they have parallelization implemented, except those with explicit smp off and I think this is how your current code works.

By saying "experimental" I mean that the new feature is not available in the documentation. The feature should be in experimental stage before the following issues are solved:
* Refactoring the base classes of components and biases by adding virtual functions like `calc_value_smp()` and `update_smp()` , or adding new classes like `colvarcomp_smp` and `colvarbias_smp` (the former looks better but I am not quite sure);

* Implementing SMP parallelization for the calculations in `colvarmodule::atom` and `colvarmodule::rotation`;

* You are satisfied with the name "inner_loop" and agree to keep it in the documentation.

Overall that looks good to me. Again, sorry if something I wrote wasn't completely clear. For example, I would not want to force people to manually enable a new flag on for every object: that would be so tedious! The smp flag should be on by default for an object that implements it (as you also seem to suggest).

If you want make the above behavior optional for now via an undocumented flag, that would make sense. But when the code is no longer experimental, you are okay with enabling it by default, correct?

HanatoK · 2025-03-19T18:07:27Z

Overall that looks good to me. Again, sorry if something I wrote wasn't completely clear. For example, I would not want to force people to manually enable a new flag on for every object: that would be so tedious! The smp flag should be on by default for an object that implements it (as you also seem to suggest).

If you want make the above behavior optional for now via an undocumented flag, that would make sense. But when the code is no longer experimental, you are okay with enabling it by default, correct?

Yes. I agree with you.

…reads

HanatoK · 2025-03-21T21:28:47Z

OK. I have rebased this PR after merging #782, but I will still rebase this one or #775 depending on which one merges first.

giacomofiorin reviewed Mar 11, 2025

View reviewed changes

HanatoK commented Mar 18, 2025

View reviewed changes

HanatoK marked this pull request as ready for review March 18, 2025 19:46

HanatoK mentioned this pull request Mar 19, 2025

fix: fix the use of Charm++ locks and CkLoop #782

Merged

HanatoK and others added 7 commits March 21, 2025 15:56

feat: add a new SMP mode for further parallelizing the inner loops

c37ed12

refactor: use getter and setter for smp_mode

fead1c9

Add feature flag for biases that can process data in parallel over th…

54ea195

…reads

Use f_cvb_smp flag in OPES

2b45a04

Add forgotten initialization

cb94cda

Replace SMP loop over colvar componetns with generic SMP loop function

89a49cf

Add NAMD version of colvarproxy::smp_loop()

d1f8c69

HanatoK force-pushed the smp_mode branch from 46b0fb9 to d1f8c69 Compare March 21, 2025 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

HanatoK commented Mar 11, 2025

giacomofiorin Mar 11, 2025

jhenin Mar 13, 2025

HanatoK Mar 17, 2025

giacomofiorin Mar 17, 2025

HanatoK Mar 17, 2025

HanatoK Mar 17, 2025

giacomofiorin Mar 17, 2025 •

edited

Loading

HanatoK Mar 17, 2025

giacomofiorin Mar 17, 2025

HanatoK Mar 18, 2025

HanatoK commented Mar 17, 2025

HanatoK commented Mar 17, 2025

giacomofiorin commented Mar 17, 2025

HanatoK commented Mar 17, 2025

giacomofiorin commented Mar 17, 2025

HanatoK Mar 18, 2025

giacomofiorin Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025 •

edited

Loading

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025 •

edited

Loading

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 19, 2025

HanatoK commented Mar 19, 2025

HanatoK commented Mar 21, 2025

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

Are you sure you want to change the base?

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

Conversation

HanatoK commented Mar 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giacomofiorin Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HanatoK commented Mar 17, 2025

HanatoK commented Mar 17, 2025

giacomofiorin commented Mar 17, 2025

HanatoK commented Mar 17, 2025

giacomofiorin commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025 • edited Loading

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025 • edited Loading

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 18, 2025

HanatoK commented Mar 18, 2025

giacomofiorin commented Mar 19, 2025

HanatoK commented Mar 19, 2025

HanatoK commented Mar 21, 2025

giacomofiorin Mar 17, 2025 •

edited

Loading

giacomofiorin commented Mar 18, 2025 •

edited

Loading

giacomofiorin commented Mar 18, 2025 •

edited

Loading