Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add an option to parallelize the inner loops instead of CVCs #780

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

HanatoK
Copy link
Member

@HanatoK HanatoK commented Mar 11, 2025

As discussed in #700, this PR aims to provide the users an option (smp inner_loop) to parallelize the inner loops.

Currently this is only in early stage so I mark it as a draft PR.

Comment on lines 393 to 398
if (smp == "cvcs" || smp == "on" || smp == "yes") {
proxy->smp_mode = colvarproxy_smp::smp_mode_t::cvcs;
cvm::log("SMP parallelism will be applied to Colvars components.\n");
cvm::log(" - SMP parallelism: enabled (num. threads = " + to_str(proxy->smp_num_threads()) + ")\n");
} else if (smp == "inner_loop") {
proxy->smp_mode = colvarproxy_smp::smp_mode_t::inner_loop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the addition of an enum flag, which is certainly more useful than a yes/no flag moving forward.

However, the kewyords identifying its values, being user visible, need some careful choice. Once defined, you can't change them any more. I am specifically worried that "cvcs" is very rarely used in the doc (it's mostly used in the source code) and that the meaning of "inner_loop" would be easily mistaken in the case of O(NxM) variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc we use "components", so that could be used here, or "colvar_components" if we don't mind a slightly long word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @giacomofiorin and @jhenin! Do you have any suggestions for the keywords?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular PR only adds support for running the inner loop of the OPES bias. All the data of other biases, as well as atoms within the components, are still processed sequentially. So a keyword that is at least specific to the biases (if not OPES itself) would be appropriate. Maybe bias_data or bias_inner?

How about supporting the following?

  • smp on, which is the current meaning, but we document more clearly that this is specifically for the "outer" loop over colvar components and biases
  • smb bias_xxx (new keyword), where you should document that the parallelization currently supports only OPES, and that turning it on will disable all other SMP schemes
  • smp off, where everything runs sequentially on a single thread

Then as we improve support we can add further customizations, but we can also preserve the on keyword value as the option to enable SMP in the largest combination of features. (Ideally, we could even have the code do quick benchmarks and enable the best configuration, but that's clearly outside this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this PR touches the OPES code, it does not make the parallelization in OPES work immediately, because I have found that macros USE_CKLOOP and CMK_SMP when compiling colvarsbias_opes.C. I will have to solve this issue in later commits or PRs by adding functions to colvarproxy_smp to abstract the parallelization implementations from OpenMP and Charm++

This PR intends to support the parallelization scheme discussed in #700 (not only specifically OPES). I am going to parallelize the optimal alignment, RMSD, COM calculations and metadynamics, so I think a keyword like smp bias_xxx may be too specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, the two parallelization schemes are not mutually exclusive, but making them mutually exclusive can simplify the implementation a lot, and make the further port to GPU easier.

I am not sure how moving the option into colvardeps works. Could you make a PR if you have time?

Copy link
Member

@giacomofiorin giacomofiorin Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could do a PR against this branch, but because it lives on your fork the PR would not show up here. How about a simple push since the two are closely related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. You can push your commits here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a couple of commits. I did not touch the smp_mode_t yet, which is the way you set it, i.e. mutually exclusive schemes.

But with a feature flag internal to the bias, the OPES numerical implementation can test only for that flag specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giacomofiorin As for the option value names of smp, I don't have a better idea than inner_loop. I have tried to ask the LLM to see if there are any better options (https://www.perplexity.ai/search/i-am-revising-the-parallelizat-Unu3PHipRTWaBb0T29P5pA). It looks like that "fine_grained" may be a little bit better. Maybe I should revise the manual to provide a figure about the difference of the two parallelization schemes. How do you think of it?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 17, 2025

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

@HanatoK
Copy link
Member Author

HanatoK commented Mar 17, 2025

The OPES parallelization implementation can be found in HanatoK#3. I will wait @giacomofiorin's commits here before merging that one.

@giacomofiorin
Copy link
Member

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

No, indeed. I had originally thought that the two specialized functions smp_colvars_loop() and smp_biases_loop() could both be replaced by a generic function, which takes as its arguments the length of the loop and the "worker" function. Do you see any potential issues with that approach?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 17, 2025

Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines...

No, indeed. I had originally thought that the two specialized functions smp_colvars_loop() and smp_biases_loop() could both be replaced by a generic function, which takes as its arguments the length of the loop and the "worker" function. Do you see any potential issues with that approach?

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

@giacomofiorin
Copy link
Member

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

Well, my preference would be to use that to support multiple SMP mechanisms in the proxy, as opposed to hard-coding pragmas for both OpenMP and Charm++ separately all over the library.

@@ -257,6 +257,8 @@ class colvardeps {
f_cvb_scale_biasing_force,
/// \brief whether this bias is applied to one or more ext-Lagrangian colvars
f_cvb_extended,
/// Process this bias's data in parallel over multiple CPU threads
f_cvb_smp,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giacomofiorin Should we further divide the SMP case into flags like f_cvb_smp_omp, f_cvb_smp_ckloop, f_cvb_smp_cuda and f_cvb_smp_sycl? What about the Colvars components? Is it possible to also have f_cvc_smp_*, or reuse the f_cvb_smp for CVCs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Every such feature flag should be added manually, and their dependencies or conflicts written down explicitly. So it would be better to add a new feature only when it's about to be used.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 18, 2025

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

Well, my preference would be to use that to support multiple SMP mechanisms in the proxy, as opposed to hard-coding pragmas for both OpenMP and Charm++ separately all over the library.

Originally I had the same idea as you, but after trying to write interfaces to abstract the differences between OpenMP and CkLoop in the proxy, I found that were difficult tasks even for a simple for-loop with reduction. Considering the porting to GPU in the future, it is even impossible to hide all the details behind the proxy. In other words, the problem is supporting multiple parallelization libraries in Colvars in case of smp inner_loop, so eventually we will have to conditionally compile different "kernels" for a component or bias.

My suggestion is to use different Colvars macros to indicate different parallelization libraries. For CkLoop, we can add a macro -DCOLVARS_SMP_CKLOOP when building with NAMD, -DCOLVARS_SMP_OPENMP when building with GROMACS and LAMMPS, and -DCOLVARS_SMP_CUDA when building with CUDA and the CUDA kernels are ready for components and biases. We then create new classes for components inheriting the existing serial implementations. For example, we can have colvar::orientation_ckloop and colvar::orientation_omp (or colvar::orientation with template arguments for SMP libraries) conditionally compiled and called based on the macros -DCOLVARS_SMP_CKLOOP and -DCOLVARS_SMP_OPENMP.

In summary, I think it is important to not only implement smp inner_loop but also provide a path for future GPU works. @giacomofiorin How do you think of it?

@giacomofiorin
Copy link
Member

giacomofiorin commented Mar 18, 2025

Hi @HanatoK! Yes, I absolutely agree that we cannot hide everything in the proxy (even if we would like to! 😆). This is particularly true for GPU support, where the movement of data between host and device is a critical step that the proxy can't do automatically. This issue makes GPU support more similar to message-passing parallelism than to SMP, in my opinion.

I also agree that any changes introduced now should aim to simplify future implementation work, as much as it is possible. In general, this goal can be achieved by making code more reusable. For the colvar components, I always thought that the compute functions could be broken into individual steps, each written as a template so that you could include it in the kernels' definition. Is that something similar to what you were thinking also?

Lastly, for the CPU-based SMP backends (i.e. OpenMP or CkLoop), I don't think that we necessarily need to interleave the computation and reduction at all costs. We could always accumulate partial sums locally for each threads, and then add all the partial sums sequentially. This is already what is currently done in the "outer loop", where multiple components are computed in parallel and then aggregated into the value of each colvars by thread 0. Aside from the GPU support (which we agree is more complex), do you see an issue with this that maybe I'm missing?

@giacomofiorin
Copy link
Member

There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed.

The two most recent commits at this time achieve this for the loop over colvar components. I plan on converting the other current main SMP loop, the one over the biases, shortly.

Do you know if the tracing calls to traceUserBracketEvent() need to be run within each thread-parallel region, or can they be run in right outside of it?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 18, 2025

@giacomofiorin Thank you for your reply! I will not introduce further code changes into this PR. As for enabling the OPES parallelization, I will do that in another PR, and possibly refactor the colvarbias interface with a new virtual update function like virtual int update_smp(). As for smp inner_loop (or maybe I change it to innerLoop to match the Colvars naming convention), do you have any better suggestion that covers not only a specific bias but also all other biases and components? If you are fine with smp inner_loop, I can update the document to include a figure to explain the differences between parallelization scheme.

I have no idea about traceUserBracketEvent(). I don't use the tracing at all.

@giacomofiorin
Copy link
Member

giacomofiorin commented Mar 18, 2025

@giacomofiorin Thank you for your reply! I will not introduce further code changes into this PR.

Ok, so what should we do with this PR? Move it out of draft status, or work on it more after we get a clearer idea of the constraints coming from parallelizing more features?

In regard to moving forward with enabling the OPES parallelism, which you have largely done already, how about the following simple approach?

  1. In simulations without OPES, we don't change anything, i.e. we keep looping over all colvar components and all biases in parallel.
  2. In simulations with one OPES bias, we keep looping over colvar components in parallel, but run the biases one at a time, and the OPES bias itself in parallel. That is, we automatically adopt the inner_loop scheme for the biases whenever OPES is defined.

Then of course, once more features are parallelized internally (biases or components) we revisit the simple criterion above. What do you think?

@HanatoK HanatoK marked this pull request as ready for review March 18, 2025 19:46
@HanatoK
Copy link
Member Author

HanatoK commented Mar 18, 2025

  1. In simulations without OPES, we don't change anything, i.e. we keep looping over all colvar components and all biases in parallel.
  2. In simulations with one OPES bias, we keep looping over colvar components in parallel, but run the biases one at a time, and the OPES bias itself in parallel. That is, we automatically adopt the inner_loop scheme for the biases whenever OPES is defined.

Then of course, once more features are parallelized internally (biases or components) we revisit the simple criterion above. What do you think?

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

@giacomofiorin
Copy link
Member

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

Well, that's actually very close to what I meant. If you are okay with using the f_cvb_smp feature, the SMP mode can default to the "inner loop" scheme when f_cvb_smp is available, or to the original scheme otherwise.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 18, 2025

This is already very sophisticated. I incline to just let the user switch to the inner_loop manually, and when f_cvb_smp is not available, we just fallback to the serial execution of biases.

Well, that's actually very close to what I meant. If you are okay with using the f_cvb_smp feature, the SMP mode can default to the "inner loop" scheme when f_cvb_smp is available, or to the original scheme otherwise.

I am OK with it. Actually I thought for CUDA we could have a similar flag (#652 (reply in thread)).

As for the name inner_loop for the smp option, are you OK with it? I think we can complement the documentation in future PRs if we don't want to publicize the new option for now. smp inner_loop should be considered experimental anyway.

@giacomofiorin
Copy link
Member

If "experimental" means off by default, how long do you think that would be the case? Once you have the OPES parallelism working, it would be great to enable it by default.

If your idea is to give people finer control over parallel features, maybe we allow the option to activate SMP at the level of individual objects? For example, thesmp flag of the OPES bias (internally represented as f_cvb_smp) could default to off at first, and enabling turns on the "inner loop" mode. Then once you are confident that the feature works, it could be on by default.

As far as the current SMP loops are concerned, these could continue to run over objects that have smp off (either by design or user choice). smp off could remain as an optional kill switch, in case of bugs.

How does that sound?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 18, 2025

If "experimental" means off by default, how long do you think that would be the case? Once you have the OPES parallelism working, it would be great to enable it by default.

If your idea is to give people finer control over parallel features, maybe we allow the option to activate SMP at the level of individual objects? For example, thesmp flag of the OPES bias (internally represented as f_cvb_smp) could default to off at first, and enabling turns on the "inner loop" mode. Then once you are confident that the feature works, it could be on by default.

As far as the current SMP loops are concerned, these could continue to run over objects that have smp off (either by design or user choice). smp off could remain as an optional kill switch, in case of bugs.

How does that sound?

I realize that the logic is a bit complicated after introducing the smp option for all biases. In my opinion, the global switch smp inner_loop should turn on smp for all biases once they have parallelization implemented, except those with explicit smp off and I think this is how your current code works.

By saying "experimental" I mean that the new feature is not available in the documentation. The feature should be in experimental stage before the following issues are solved:

  • Refactoring the base classes of components and biases by adding virtual functions like calc_value_smp() and update_smp() , or adding new classes like colvarcomp_smp and colvarbias_smp (the former looks better but I am not quite sure);
  • Implementing SMP parallelization for the calculations in colvarmodule::atom and colvarmodule::rotation;
  • You are satisfied with the name "inner_loop" and agree to keep it in the documentation.

@giacomofiorin
Copy link
Member

I realize that the logic is a bit complicated after introducing the smp option for all biases. In my opinion, the global switch smp inner_loop should turn on smp for all biases once they have parallelization implemented, except those with explicit smp off and I think this is how your current code works.

By saying "experimental" I mean that the new feature is not available in the documentation. The feature should be in experimental stage before the following issues are solved:

* Refactoring the base classes of components and biases by adding virtual functions like `calc_value_smp()` and `update_smp()` , or adding new classes like `colvarcomp_smp` and `colvarbias_smp` (the former looks better but I am not quite sure);

* Implementing SMP parallelization for the calculations in `colvarmodule::atom` and `colvarmodule::rotation`;

* You are satisfied with the name "inner_loop" and agree to keep it in the documentation.

Overall that looks good to me. Again, sorry if something I wrote wasn't completely clear. For example, I would not want to force people to manually enable a new flag on for every object: that would be so tedious! The smp flag should be on by default for an object that implements it (as you also seem to suggest).

If you want make the above behavior optional for now via an undocumented flag, that would make sense. But when the code is no longer experimental, you are okay with enabling it by default, correct?

@HanatoK
Copy link
Member Author

HanatoK commented Mar 19, 2025

Overall that looks good to me. Again, sorry if something I wrote wasn't completely clear. For example, I would not want to force people to manually enable a new flag on for every object: that would be so tedious! The smp flag should be on by default for an object that implements it (as you also seem to suggest).

If you want make the above behavior optional for now via an undocumented flag, that would make sense. But when the code is no longer experimental, you are okay with enabling it by default, correct?

Yes. I agree with you.

@HanatoK
Copy link
Member Author

HanatoK commented Mar 21, 2025

OK. I have rebased this PR after merging #782, but I will still rebase this one or #775 depending on which one merges first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants