-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add an option to parallelize the inner loops instead of CVCs #780
base: master
Are you sure you want to change the base?
Conversation
src/colvarmodule.cpp
Outdated
if (smp == "cvcs" || smp == "on" || smp == "yes") { | ||
proxy->smp_mode = colvarproxy_smp::smp_mode_t::cvcs; | ||
cvm::log("SMP parallelism will be applied to Colvars components.\n"); | ||
cvm::log(" - SMP parallelism: enabled (num. threads = " + to_str(proxy->smp_num_threads()) + ")\n"); | ||
} else if (smp == "inner_loop") { | ||
proxy->smp_mode = colvarproxy_smp::smp_mode_t::inner_loop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the addition of an enum flag, which is certainly more useful than a yes/no flag moving forward.
However, the kewyords identifying its values, being user visible, need some careful choice. Once defined, you can't change them any more. I am specifically worried that "cvcs" is very rarely used in the doc (it's mostly used in the source code) and that the meaning of "inner_loop" would be easily mistaken in the case of O(NxM) variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the doc we use "components", so that could be used here, or "colvar_components" if we don't mind a slightly long word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @giacomofiorin and @jhenin! Do you have any suggestions for the keywords?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular PR only adds support for running the inner loop of the OPES bias. All the data of other biases, as well as atoms within the components, are still processed sequentially. So a keyword that is at least specific to the biases (if not OPES itself) would be appropriate. Maybe bias_data
or bias_inner
?
How about supporting the following?
smp on
, which is the current meaning, but we document more clearly that this is specifically for the "outer" loop over colvar components and biasessmb bias_xxx
(new keyword), where you should document that the parallelization currently supports only OPES, and that turning it on will disable all other SMP schemessmp off
, where everything runs sequentially on a single thread
Then as we improve support we can add further customizations, but we can also preserve the on
keyword value as the option to enable SMP in the largest combination of features. (Ideally, we could even have the code do quick benchmarks and enable the best configuration, but that's clearly outside this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this PR touches the OPES code, it does not make the parallelization in OPES work immediately, because I have found that macros USE_CKLOOP
and CMK_SMP
when compiling colvarsbias_opes.C
. I will have to solve this issue in later commits or PRs by adding functions to colvarproxy_smp
to abstract the parallelization implementations from OpenMP and Charm++
This PR intends to support the parallelization scheme discussed in #700 (not only specifically OPES). I am going to parallelize the optimal alignment, RMSD, COM calculations and metadynamics, so I think a keyword like smp bias_xxx
may be too specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, the two parallelization schemes are not mutually exclusive, but making them mutually exclusive can simplify the implementation a lot, and make the further port to GPU easier.
I am not sure how moving the option into colvardeps
works. Could you make a PR if you have time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I could do a PR against this branch, but because it lives on your fork the PR would not show up here. How about a simple push since the two are closely related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. You can push your commits here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a couple of commits. I did not touch the smp_mode_t
yet, which is the way you set it, i.e. mutually exclusive schemes.
But with a feature flag internal to the bias, the OPES numerical implementation can test only for that flag specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giacomofiorin As for the option value names of smp
, I don't have a better idea than inner_loop
. I have tried to ask the LLM to see if there are any better options (https://www.perplexity.ai/search/i-am-revising-the-parallelizat-Unu3PHipRTWaBb0T29P5pA). It looks like that "fine_grained" may be a little bit better. Maybe I should revise the manual to provide a figure about the difference of the two parallelization schemes. How do you think of it?
Due to the Charm++ issue of building with OpenMP support (charmplusplus/charm#3396), I cannot simply use Charm++ with OpenMP for NAMD, so I need to find another way to abstract the parallel interface for multiple MD engines... |
The OPES parallelization implementation can be found in HanatoK#3. I will wait @giacomofiorin's commits here before merging that one. |
No, indeed. I had originally thought that the two specialized functions |
There should be no potential issues. I think what I am going to do is leaving the original parallelization scheme untouched, while adding a new parallelization scheme that uses threads in the inner loops. It is fine that you revise the original parallelization implementation as long as the tests are passed. |
Well, my preference would be to use that to support multiple SMP mechanisms in the proxy, as opposed to hard-coding pragmas for both OpenMP and Charm++ separately all over the library. |
@@ -257,6 +257,8 @@ class colvardeps { | |||
f_cvb_scale_biasing_force, | |||
/// \brief whether this bias is applied to one or more ext-Lagrangian colvars | |||
f_cvb_extended, | |||
/// Process this bias's data in parallel over multiple CPU threads | |||
f_cvb_smp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giacomofiorin Should we further divide the SMP case into flags like f_cvb_smp_omp
, f_cvb_smp_ckloop
, f_cvb_smp_cuda
and f_cvb_smp_sycl
? What about the Colvars components? Is it possible to also have f_cvc_smp_*
, or reuse the f_cvb_smp
for CVCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Every such feature flag should be added manually, and their dependencies or conflicts written down explicitly. So it would be better to add a new feature only when it's about to be used.
Originally I had the same idea as you, but after trying to write interfaces to abstract the differences between OpenMP and CkLoop in the proxy, I found that were difficult tasks even for a simple for-loop with reduction. Considering the porting to GPU in the future, it is even impossible to hide all the details behind the proxy. In other words, the problem is supporting multiple parallelization libraries in Colvars in case of My suggestion is to use different Colvars macros to indicate different parallelization libraries. For CkLoop, we can add a macro In summary, I think it is important to not only implement |
Hi @HanatoK! Yes, I absolutely agree that we cannot hide everything in the proxy (even if we would like to! 😆). This is particularly true for GPU support, where the movement of data between host and device is a critical step that the proxy can't do automatically. This issue makes GPU support more similar to message-passing parallelism than to SMP, in my opinion. I also agree that any changes introduced now should aim to simplify future implementation work, as much as it is possible. In general, this goal can be achieved by making code more reusable. For the colvar components, I always thought that the compute functions could be broken into individual steps, each written as a template so that you could include it in the kernels' definition. Is that something similar to what you were thinking also? Lastly, for the CPU-based SMP backends (i.e. OpenMP or CkLoop), I don't think that we necessarily need to interleave the computation and reduction at all costs. We could always accumulate partial sums locally for each threads, and then add all the partial sums sequentially. This is already what is currently done in the "outer loop", where multiple components are computed in parallel and then aggregated into the value of each colvars by thread 0. Aside from the GPU support (which we agree is more complex), do you see an issue with this that maybe I'm missing? |
The two most recent commits at this time achieve this for the loop over colvar components. I plan on converting the other current main SMP loop, the one over the biases, shortly. Do you know if the tracing calls to |
@giacomofiorin Thank you for your reply! I will not introduce further code changes into this PR. As for enabling the OPES parallelization, I will do that in another PR, and possibly refactor the I have no idea about |
Ok, so what should we do with this PR? Move it out of draft status, or work on it more after we get a clearer idea of the constraints coming from parallelizing more features? In regard to moving forward with enabling the OPES parallelism, which you have largely done already, how about the following simple approach?
Then of course, once more features are parallelized internally (biases or components) we revisit the simple criterion above. What do you think? |
This is already very sophisticated. I incline to just let the user switch to the |
Well, that's actually very close to what I meant. If you are okay with using the |
I am OK with it. Actually I thought for CUDA we could have a similar flag (#652 (reply in thread)). As for the name |
If "experimental" means off by default, how long do you think that would be the case? Once you have the OPES parallelism working, it would be great to enable it by default. If your idea is to give people finer control over parallel features, maybe we allow the option to activate SMP at the level of individual objects? For example, the As far as the current SMP loops are concerned, these could continue to run over objects that have How does that sound? |
I realize that the logic is a bit complicated after introducing the By saying "experimental" I mean that the new feature is not available in the documentation. The feature should be in experimental stage before the following issues are solved:
|
Overall that looks good to me. Again, sorry if something I wrote wasn't completely clear. For example, I would not want to force people to manually enable a new flag on for every object: that would be so tedious! The If you want make the above behavior optional for now via an undocumented flag, that would make sense. But when the code is no longer experimental, you are okay with enabling it by default, correct? |
Yes. I agree with you. |
As discussed in #700, this PR aims to provide the users an option (
smp inner_loop
) to parallelize the inner loops.Currently this is only in early stage so I mark it as a draft PR.