Replica propagation times systematically change in cyclic patterns that may slow simulations #613

jchodera · 2019-12-21T17:50:37Z

In plotting the total replica propagation time per iteration from one of @hannahbrucemacdonald's experiments (/data/chodera/brucemah/relative_paper/dec/jnk1/lig20to8/complex_0.stderr), it appears that there is a pattern:

Connecting successive iterations with lines reveals a clearer pattern:

Zooming in shows the pattern more clearly:

This is currently just something intriguing, but something we should investigate in the new year since it may lead to speed improvements if we can understand what is going on here.

The text was updated successfully, but these errors were encountered:

jchodera · 2020-02-03T18:49:53Z

@hannahbrucemacdonald : If you have a chance before I do, here’s what I was doing to look into the variable timing issue.
I inserted these lines into my in-place installation of openmmtools. You can find where this is installed with

python -c "from openmmtools import multistate; print(multistate.__file__)"

I edited multistatesampler.py and inserted the following at the end of _propagate_replica
at line 1235:
https://github.com/choderalab/openmmtools/blob/master/openmmtools/multistate/multistatesampler.py#L1235
Here are the lines:

 # DEBUG: Report nvidia-smi                                                                                                                                                                                                                                                                                                                                          
 import subprocess
 process = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE, universal_newlines=True)
 logger.debug(process.stdout)

This should write the output of nvidia-smi each replica propagation step to report current temperature, voltage, and speed:

Tue Jan 21 12:19:31 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:02:00.0 Off |                  N/A |
|  0%   31C    P8     5W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The relevant numbers are
• the drive rversion
• the GPU type
• the fan %
• the temperature
• the power usage
• the process table

hannahbrucemacdonald · 2020-02-11T16:28:36Z

Did you only see this in complex, or solvent and vacuum too?

hannahbrucemacdonald · 2020-02-11T18:04:05Z

Hey - I've got so far as to run a short bit of complex with Thrombin and here are the minimal plots. Looks like we have oscillations not correlated to the GPU temp. It looks like it's stabilising around 80C which should be fine. I will have a go with the other parameters that you've suggested too

System: Thrombin
GPU type: NVIDIA-SMI 418.39
Driver version: 418.39

(edited to fix x-axis labels)

hannahbrucemacdonald · 2020-02-11T18:15:46Z

Fan speed is quite variable, but also doesn't seem to correlate

jchodera · 2020-02-11T21:06:37Z

Are you plotting this every iteration or every replica propagation time? I was hoping to get more fine grained information by printing the output of nvidia-smi every replica propagation substep

hannahbrucemacdonald · 2020-02-11T21:30:23Z

Ahhhh I see what you mean, I have that I think - give me a min.

hannahbrucemacdonald · 2020-02-11T21:37:53Z

Ok so I have the time of each replica iteration from the nvidia-smi output, but it's in seconds, while each replica propogation is ~1 second. I can't see anything useful about each replica taking longer or not because of the rounding

Unless I'm getting the wrong end of the stick.

I can increase the amount of MD (~factor 10?) per step to push this up? We currently do... 250*4 fs per replica per iteration (1 picosecond).

jchodera · 2020-02-12T11:00:24Z

Did you only see this in complex, or solvent and vacuum too?

@hannahbrucemacdonald : Great question! I used this jupyter notebook to generate this plot for all phases from /data/chodera/brucemah/relative_paper/dec/jnk1/lig20to8/---just change the paths in the notebook to plot a different calculation!

GitHub seems to have trouble with image uploads right now, but check out the notebook:

cyclic behavior is only observed with complex
the iteration time for solvent keeps growing with iteration number---is this a NetCDF issue?
the vacuum time keeps growing, but also switches back and forth between fast and slow in long time blocks (!!!); are we using the CPU or Reference platform for vacuum?

jaimergp · 2020-02-12T19:04:34Z

That temperature plot looks suspicious to me. Temp throttling can definitely happen at intervals.

Some data pulled from this article:

Ideally you want temperatures to be as low as possible, anything below 80 degrees is normal and should keep throttling in check. Nvidia's GTX 1080 Ti, for example, has a throttling point of 84 degrees. If you keep the temperature below 80 degrees you leave yourself with a bit of breathing room, so you can focus on having fun instead of monitoring GPU frequencies.

It's important to remember that every graphics card has a different throttling point. The previous-gen GTX 980 and 970, for example, throttle at 80 degrees, while AMD's Vega series cards can reach a maximum temperature of 85 degrees before they throttle. You will need to find out the throttling point for your specific card in order to set an effective fan curve and voltage.

So we might be hitting 80ºC, decreasing performance to "calm down" (but keeping the temperature just a bit below 80ºC!), then we think we can use more power and hit 80ºC, chill down, etc.

jaimergp · 2020-02-12T19:05:17Z

Power usage might give more details.

hannahbrucemacdonald · 2020-02-12T21:49:50Z

Sorry got a little further --- but not much - I can expand get_details() to wrangle in more details

biggest addition is looking at /data/chodera/brucemah/relative_paper/amber_starting/off100/thrombin/test/lig1to10/

h_edits_to_ipynb.zip

timestep is still too small to really see the finetooth timing... I'll run something slower in the same directory - hopefully it will be there for european morning time!

hannahbrucemacdonald · 2020-02-12T21:58:00Z

Hopefully there is some more output in /data/chodera/brucemah/relative_paper/amber_starting/off100/thrombin/test/SLOWER1to10 for you to see in the AM!

jchodera · 2020-02-13T12:58:02Z

So we might be hitting 80ºC, decreasing performance to "calm down" (but keeping the temperature just a bit below 80ºC!), then we think we can use more power and hit 80ºC, chill down, etc.

This is my hypothesis exactly!

jchodera · 2020-02-13T13:02:04Z

It really looks like the temperature is causing the driver to throttle the speed here:

To prove this, we also need you to pull the power usage. That's essentially a direct readout of the clock speed.

Here, the power usage is the "46W".

Could you also pull the node name(s), the driver version, and GPU type?

Once we have some good data compiled, we can ask the sysadmins to tweak the driver settings a bit to minimize this.

jchodera · 2020-02-13T13:17:36Z

There's more info on nvidia-smi and its options here:
https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/

There are some useful options here:

@jaimergp suggests that we especially want to monitor the Power State to make sure it is always in P0. When it hits a thermal limit, it will switch into lower power states (P1 or P2) until it returns to the thermal envelope that P0 is allowed.

The sysadmins have control over the power profile to modify the policy used to change power states.
Here's a guide from NVIDIA.

hannahbrucemacdonald · 2020-02-13T14:10:39Z

hannahbrucemacdonald · 2020-02-20T18:55:29Z

Back working on this now Lilac is back --- Power state is depreciated and now called Performance state. It seems at first glance to always be at P2

jchodera · 2020-02-20T19:07:07Z

@hannahbrucemacdonald :
Can you try sampling with nvidia-smi -q -d POWER,CLOCK,TEMPERATURE,PERFORMANCE?

(base) [chodera@lilac:chodera]$ nvidia-smi -q -d POWER,CLOCK,TEMPERATURE,PERFORMANCE

==============NVSMI LOG==============

Timestamp                           : Thu Feb 20 14:04:14 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:84:00.0
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    Temperature
        GPU Current Temp            : 37 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 8.62 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Power Samples
        Duration                    : 118.14 sec
        Number of Samples           : 119
        Max                         : 9.79 W
        Min                         : 8.52 W
        Avg                         : 8.79 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    SM Clock Samples
        Duration                    : 257.12 sec
        Number of Samples           : 100
        Max                         : 1822 MHz
        Min                         : 139 MHz
        Avg                         : 400 MHz
    Memory Clock Samples
        Duration                    : 257.12 sec
        Number of Samples           : 100
        Max                         : 5508 MHz
        Min                         : 405 MHz
        Avg                         : 997 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

and look at these things:

Check performance state (which isn't always P2, but should always be P0 for maximum performance):

    Performance State               : P8

Compare current temp vs slowdown/shutdown temps?

* GPU Current Temp            : 40 C
* GPU Shutdown Temp           : 96 C
* GPU Slowdown Temp           : 93 C

Compare power draw vs power limit?

        Power Draw                  : 9.02 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W

Compare clocks:

    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz

There is useful info on ensuring we can get maximum performance here:
https://devtalk.nvidia.com/default/topic/892842/cuda-programming-and-performance/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/

jchodera · 2020-02-20T19:07:45Z

Note that step 1---just looking to make sure we are always in P0---may be sufficient here! If we're not, we should post on the HPC slack and have them fix this.

hannahbrucemacdonald · 2020-02-20T19:08:13Z

^ I'm saying that we are always in P2 i.e. from the first step

jchodera · 2020-02-20T19:48:14Z

From my reading, I've determined

P0 is the highest performance state, but GTX 1080 / 1080Ti cards have drivers that force them into lower-performance P2 when they detect compute loads
There is a Windows utility that can disable this, but there is no corresponding linux utility (yet)
There may still be additional throttling going on due to temperature, which looks like what is going on in your plots. We may still be able to change a setting which minimizes this throttling behavior. I'm looking into this further.

jchodera · 2020-02-21T02:17:49Z

Here's what I've found so far:

This detailed description of inspecting and controlling your GPU settings on linux is useful, but seems limited in what can be done with GTX cards compared to Tesla cards (e.g. V100)
Tesla cards also support fancy datacenter GPU manager software, but GTX cards appear not to
There are some driver settings for GPU boosting that may be relevant, but I'm not sure if they work on the GTX cards
This post shows some ideas on how to increase the memory clock speed
This thread discusses the issues of being locked into P2 in more detail
This thread specifically talks about being locked into P2 in compute mode on GTX 1080 Tis
This thread presents an intriguing solution that uses a driver setting that appears to be totally undocumented (modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0") to force into P0:

rmmod nvidia; modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0"; nvidia-smi -pm 1; sleep 30; nvidia-smi

Edit: It appears that this is described as an option for the windows-only nvidiaInspector, but the above presents a way to do this in linux.

This post talks more about power mizer settings and how to disable it.

hannahbrucemacdonald · 2020-03-10T18:31:07Z

Just had a chat with Sam Ellis (MolSSI) about this --- he suggested that we try run with OpenCL to just sanity check if it's a cuda issue or not? He wasn't sure, but seeing as we've tried it on a few different hardwares, he said it might be worth narrowing down if it's more on the software side of things.

Also - I haven't run this since upgrading to cuda/10.1 and openmm 7.4.1, which I guess you have been running?

jchodera · 2022-07-19T01:28:25Z

@peastman : This is fantastic! This makes a great deal of sense, and I think fully explains the odd behavior we've observed.

Is there a way to trigger the reordering via the API? If so, we could call this every time we call context.setPositions(). Or would it make sense to automatically reorder after updating the positions?

If not, we might be able to restructure the calculations to try to use a separate Context for each replica if the GPU has space, to eliminate the need for atom reorderings.

jchodera · 2022-07-19T04:29:10Z

@peastman : It looks like CudaUpdateStateDataKernel::setPositions does call cu.reorderAtoms(), but because reorderAtoms() only actually reorders atoms if stepsSinceReorder < 250, since it doesn't first reset stepsSinceReorder to a high number, this will rarely actually cause the atoms to be reordered. This seems like unintended behavior, and that adding cu.setStepsSinceReorder(99999) right before cu.reorderAtoms() would ensure the atoms are correctly reordered after setPositions() is called.

zhang-ivy · 2022-07-19T12:58:32Z

@peastman : Thanks so much for identifying this!

Is there a way to trigger the reordering via the API? If so, we could call this every time we call context.setPositions(). Or would it make sense to automatically reorder after updating the positions?

Atom reordering seems quite expensive, so although this might remove the cycling problem, the average iteration time will probably be much longer, right?

If not, we might be able to restructure the calculations to try to use a separate Context for each replica if the GPU has space, to eliminate the need for atom reorderings.

This seems like a better solution

but because reorderAtoms() only actually reorders atoms if stepsSinceReorder < 250, since it doesn't first reset stepsSinceReorder to a high number, this will rarely actually cause the atoms to be reordered. This seems like unintended behavior,

I think it reorders atoms if stepsSinceReorder >= 250, and every time it is < 250, it increments stepsSinceReorder by 1, so its just reordering every 250 steps:

void ComputeContext::reorderAtoms() {
    atomsWereReordered = false;
    if (numAtoms == 0 || !getNonbondedUtilities().getUseCutoff() || stepsSinceReorder < 250) {
        stepsSinceReorder++;
        return;
    }
    atomsWereReordered = true;
    stepsSinceReorder = 0;
    if (getUseDoublePrecision())
        reorderAtomsImpl<double, mm_double4, double, mm_double4>();
    else if (getUseMixedPrecision())
        reorderAtomsImpl<float, mm_float4, double, mm_double4>();
    else
        reorderAtomsImpl<float, mm_float4, float, mm_float4>();
}

ijpulidos · 2022-07-19T14:47:34Z

@peastman These are great points you are making. It makes a lot of sense. Was this sampling algorithm something that has been changed in the dev version of Openmm? Or in other words, why is this issue not being shown in the current 7.7 release?

On the other hand, there used to be a time when if we specified a context_cache to the MCMCMove object then they were deep copied. But then this generated the memory issue in choderalab/openmmtools#521. Since then, many of the things related to the context cache objects have been simplified. That said, I guess we can still try to do this in a more intelligent way or have another parameter that can control how many replicas get copied with an independent context or something similar.

zhang-ivy · 2022-07-19T14:55:49Z

. Was this sampling algorithm something that has been changed in the dev version of Openmm? Or in other words, why is this not being shown in the current 7.7 release?

I think by "sampling algorithm", he means the details of how we are running repex, which is outside of openmm (in openmmtools). But Iván does raise a good point -- are there any changes to atom reordering that were added between 7.7 and 7.7.0.dev1? If not, then I wonder if there is a different change that is causing the difference in behaviors between the two openmm versions (when using 2 MPI processes)?

jchodera · 2022-07-19T15:48:10Z

I think by "sampling algorithm", he means the details of how we are running repex, which is outside of openmm (in openmmtools).

We have two separate stages:

propagate_replicas() loops through each replica, potentially reusing a single Context:

for replica in replicas:
    set positions to replica
    take n_steps of dynamics

If the atoms are not reordered after the new replica positions are set, some fraction of those steps will be taken with bad atom orderings, resulting in slow dynamics. In the worst case, all steps will be slow. If we're using a separate ContextCache for propagation, the phase shift each iteration would be n_replicas * n_steps % 250, where the higher the phase shift, the slower the calculation. This would cause the periodic behavior we are observing in the total iteration propagation time. Triggering atom reordering after updating positions would eliminate this cycling, as would caching a separate context for each replica.

For energy computation, we currently use the idiom

for state in states:
    for replica in replicas:
        set positions for replica
        compute energies in state

This is means that energy evaluation will also take longer since we're always using the suboptimal atom orderings. Instead, we should invert this order, or cache a separate context for each replica.

Atom reordering seems quite expensive, so although this might remove the cycling problem, the average iteration time will probably be much longer, right?

Running up to 250 steps with bad atom orderings might be much more expensive than reordering atoms once at the beginning and using the optimal atom ordering for the replica propagation.

jchodera · 2022-07-19T15:52:03Z

@peastman : One other thought: Why fix the atom reordering interval at all? If it can have such a big impact on performance, wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

ijpulidos · 2022-07-19T20:21:50Z

Triggering the reordering of atoms does work around the issue. I tried what @jchodera suggested in this comment and the results are as follow. Sorry for the crowded plot but I think it's readable enough.

The ones labeled with "fix" are the ones with the change, the impact on performance is not too much, and it's also better if we use the UseBlockingSync: false option (maybe we should make this the default?).

The localbuild is openmm at the commit @zhang-ivy pointed in a previous comment, which corresponds to her environment and to the openmm-7.7.0-dev1 package at conda-forge.

mikemhenry · 2022-07-19T20:25:59Z

@peastman : One other thought: Why fix the atom reordering interval at all? If it can have such a big impact on performance, wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

It is a lot of work, but I've seen other MD engines auto-tune the neighbor list parameters, the advantage of auto-tuning is that there are many choices in neighbor list parameters that are very system dependent.

peastman · 2022-07-20T02:31:32Z

are there any changes to atom reordering that were added between 7.7 and 7.7.0.dev1?

Not that I know of. Clearly something changed, since the behavior is slightly different. But it seems to be a change for the better, since average performance is now slightly faster than before.

wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

The challenge is to come up with a heuristic that's efficient in all cases, or at least not too horribly inefficient. And of course, that requires prioritizing which use cases are most important and which are less important.

the advantage of auto-tuning is that there are many choices in neighbor list parameters that are very system dependent.

In this case, the thing we need to adapt to is the pattern of what methods you invoke in what order. When it's just a matter of looking at simulation settings, you can analyze them and try to pick optimal parameters. But each time you call setPositions(), the "optimal" choice depends on what other methods you're going to call later. If you're going to do just one force evaluation then call setPositions() again, it's best to stick with the current non-optimal ordering. It will still be faster than reordering just for one evaluation. If you're going to run a lot of dynamics starting from it, then it's faster to reorder so all the subsequent steps will be faster.

The current behavior compromises between the two. It will do at most 250 inefficient steps before reordering again. Unfortunately, that just happens to be the exact number of steps the above script executes before switching replicas, which is maybe the worst possible choice. If it did 1000 steps then at least 75% of steps would be guaranteed to have an efficient ordering.

The question is whether we can find a different behavior that makes this particular use case faster without making a more important use case slower, or making a less important but still relevant use case dramatically slower.

jchodera · 2022-07-20T03:13:47Z

It sounds like it would be more performant to keep track of the number of steps of dynamics since the last setPositions() or reorderAtoms() call, and reorder atoms in the Integrator.step() call instead.

Even better would be to auto-tune based on the number of steps of dynamics since the last setPositions() or reorderAtoms(), since this could be adaptive to the individual system and integrator used and avoid significant speed degradation.

It could potentially be useful to use total atomic displacements, or pairlist updates, but it might be hard to find a universal heuristic for those. Just auto-tuning based on number of steps since last setPositions() or reorderAtoms() seems like the best choice.

zhang-ivy · 2022-07-28T13:39:05Z

Some notes from our perses dev sync this past Monday:

What are we going to do about repex cycling issue?

In openmm -- still working with Peter to determine what the appropriate fix in OpenMM would be
In openmmtools -- we may want to restructure the MultiStateSampler to have a mode that allows for a separate context cache for each replica, though having a context cache for each replica may cause memory issues and slowdowns in other ways. Would need to profile this more to determine whether this mode would remove the cycling problem without causing memory/slowdown issues.
Without openmm or openmmtools changes -- we could force the n_mpi_processes == n_replicas. This would basically be the same as creating a new context for each replica.

I tried the last option, and here are the results:

In each of the above experiments, I am running t-repex on barnase:barstar using 2 GeForce RTX 2080 Ti and n_replicas = 24 with OpenMM 7.7.0.dev1. Though using 24 MPI processes causes the cycling problem to disappear, it seems like the overhead of using so many processes causes a slowdown that is not worth it.

peastman · 2022-07-28T15:01:04Z

As a short term workaround, is it reasonable to increase the number of steps per iteration? Was 250 chosen for a good reason, or is it arbitrary? A larger number would reduce the impact of reordering and lead to faster overall performance.

jchodera · 2022-07-28T17:13:23Z

It's a reasonable suggestion to try for now. We could try 1250 steps/iteration (5 ps/iteration). This will reduce the number of exchange attempts by ~5x, however, and there is excellent evidence in the literature that the benefits of replica exchange are maximized by exchanging as frequently as possible to allow the best mixing among replicas. We'd probably get some improvement in aggregate simulation time per day, but the loss in efficiency (how much variance is reduced by a given wall clock time) may not be worth it. Thanks for doing these experiments, @zhang-ivy! I'll look into suggesting how we can restructure the inner propagation and energy computation for MultiStateSampler in the meantime. @peastman: I'm happy to try to contribute a PR to OpenMM if you have an idea of what solution you think is optimal. I do think it would be sensible to consider switching to counting number of dynamics steps instead of number of calls to reorderAtoms to determine when to reorder, and forcing reordering on the first dynamics step if the positions have just been updated since it is very likely that many steps will be taken and the efficiency boost will be worth it. If might also be useful to add a method to allow manual triggering of atom reordering, and to make the hard-coded 250 steps between reordering something that can be set as a platform parameter.

peastman · 2022-07-28T21:06:30Z

I have some preliminary ideas about how to address this. I want to take a little more time to let them solidify.

Before implementing anything, we need to identify all relevant use cases and make sure we have good benchmarks for them. We need to be sure we don't accidentally make something slower. I think these are the main cases to consider.

You call setPositions(), then run a lot of steps of dynamics. (This is the most important case, since it's what most simulations do.)
You call setPositions(), then run just a small amount of dynamics before changing them again. (This is what you do in this case.)
You repeatedly call setPositions(), always with positions that are very similar to each other, then do only a single force or energy evaluation for each one. (LocalEnergyMinimizer does this, so it's another very important case.)
Same as above, but successive positions may be very different from each other. (For example, you might be analyzing conformations generated by some other algorithm or saved from an earlier simulation.)

jchodera · 2022-07-29T05:20:31Z

I think these are the main cases to consider.

This list looks reasonable to me!

I'll tag @mikemhenry @ijpulidos @zhang-ivy @dominicrufa in case they have anything to add.

zhang-ivy · 2022-07-29T15:33:21Z

That list covers all the cases I'm interested in -- the last case is something we do in our repex code as well -- for each iteration, we compute the potential energy of each replica at each state. And the first case also covers nonequilibrium switching.

peastman · 2022-07-29T16:03:45Z

Ok, thanks. I think the behavior we want is something like this.

If you call setPositions(), and then immediately follow with a call to step(), we should immediately perform reordering. If you take one step, you're probably going to take lots of steps.
If you call setPositions() and then getState(), we should not do reordering. You may only intend a single evaluation with those coordinates.
But in case 2, we still need to remember that you set the positions with setPositions(). If you follow getState() with step(), at that point we should immediately do reordering.
If the positions get updated through any means other than a call to setPositions(), we should maintain the current behavior. Lots of classes update positions in other ways through internal APIs, but usually only for small changes to the positions.
Preferably this should be done in a way that maintains source compatibility with existing plugins.

jchodera · 2022-07-29T16:25:28Z

This sounds eminently reasonable.

Any thoughts on allowing the reordering interval to be set manually as a platform property? That should be relatively straightforward and help us discover if auto-tuning would provide significant advantages.

It would still be nice to have a manual way to force reordering via the API, but given this is a platform implementation detail, it doesn't seem like there is a clean way to introduce this into the (platform-agnostic) OpenMM API.

peastman · 2022-07-29T16:44:55Z

Any thoughts on allowing the reordering interval to be set manually as a platform property?

Let's first see if we can make it work well without the need for that. When you expose an internal detail like that, it becomes harder to change the implementation later. And then people specify a value for it without understanding what it does, just because they saw someone else do it in a script, and they get worse performance because of it.

peastman · 2022-07-29T21:40:13Z

I implemented the logic described above in #613 (comment). It mostly works, but it ends up having unintended consequences. Every time the barostat runs, it triggers a call to setPositions(), which means reordering gets done much more often before. I could work around that, but I think it would be good to consider other approaches as well. For example, I'm thinking about a method that would monitor the size of the neighbor list and trigger a reordering whenever it sees the neighbor list has grown too much. The goal is to find something robust that will automatically do the right thing in nearly all cases.

zhang-ivy · 2022-08-15T20:17:37Z

Closed via openmm/openmm#3721

jchodera added the potential bug 🤔 🐛 label Feb 3, 2020

jchodera assigned hannahbrucemacdonald Feb 3, 2020

hannahbrucemacdonald closed this as completed Feb 11, 2020

hannahbrucemacdonald reopened this Feb 11, 2020

jchodera removed the potential bug 🤔 🐛 label Feb 24, 2020

jchodera assigned jchodera and unassigned hannahbrucemacdonald Feb 24, 2020

This was referenced Jul 19, 2022

Atom reordering does not correctly occur on CUDA/OpenCL platforms after setPositions() is called openmm/openmm#3705

Closed

Fix #3705: Force reordering of atoms after setPositions() for efficiency openmm/openmm#3706

Closed

ijpulidos mentioned this issue Jul 25, 2022

Optional argument for number of MPI processes choderalab/clusterutils#13

Open

This was referenced Aug 1, 2022

UseBlockingSync defaults to false openmm/openmm#3720

Merged

Improved logic for deciding when to reorder atoms openmm/openmm#3721

Merged

zhang-ivy closed this as completed Aug 15, 2022

IAlibay mentioned this issue May 14, 2023

[ENH] Expose swap attempts for ReplicaExchangeSampler choderalab/openmmtools#696

Open

ijpulidos mentioned this issue May 31, 2023

manually casting objects to np arrays when reading checkpoint choderalab/openmmtools#701

Merged

5 tasks

Replica propagation times systematically change in cyclic patterns that may slow simulations #613

Replica propagation times systematically change in cyclic patterns that may slow simulations #613

Comments

jchodera commented Dec 21, 2019

jchodera commented Feb 3, 2020 • edited Loading

hannahbrucemacdonald commented Feb 11, 2020

hannahbrucemacdonald commented Feb 11, 2020 • edited Loading

hannahbrucemacdonald commented Feb 11, 2020

jchodera commented Feb 11, 2020

hannahbrucemacdonald commented Feb 11, 2020

hannahbrucemacdonald commented Feb 11, 2020 • edited Loading

jchodera commented Feb 12, 2020

jaimergp commented Feb 12, 2020

jaimergp commented Feb 12, 2020

hannahbrucemacdonald commented Feb 12, 2020 • edited Loading

hannahbrucemacdonald commented Feb 12, 2020

jchodera commented Feb 13, 2020

jchodera commented Feb 13, 2020 • edited Loading

jchodera commented Feb 13, 2020

hannahbrucemacdonald commented Feb 13, 2020

hannahbrucemacdonald commented Feb 20, 2020

jchodera commented Feb 20, 2020

jchodera commented Feb 20, 2020

hannahbrucemacdonald commented Feb 20, 2020

jchodera commented Feb 20, 2020

jchodera commented Feb 21, 2020 • edited Loading

hannahbrucemacdonald commented Mar 10, 2020

jchodera commented Jul 19, 2022

jchodera commented Jul 19, 2022 • edited Loading

zhang-ivy commented Jul 19, 2022 • edited Loading

ijpulidos commented Jul 19, 2022 • edited Loading

zhang-ivy commented Jul 19, 2022 • edited Loading

jchodera commented Jul 19, 2022

jchodera commented Jul 19, 2022

ijpulidos commented Jul 19, 2022

mikemhenry commented Jul 19, 2022

peastman commented Jul 20, 2022

jchodera commented Jul 20, 2022

zhang-ivy commented Jul 28, 2022

peastman commented Jul 28, 2022

jchodera commented Jul 28, 2022 via email

peastman commented Jul 28, 2022

jchodera commented Jul 29, 2022

zhang-ivy commented Jul 29, 2022

peastman commented Jul 29, 2022

jchodera commented Jul 29, 2022

peastman commented Jul 29, 2022

peastman commented Jul 29, 2022

zhang-ivy commented Aug 15, 2022

jchodera commented Feb 3, 2020 •

edited

Loading

hannahbrucemacdonald commented Feb 11, 2020 •

edited

Loading

hannahbrucemacdonald commented Feb 11, 2020 •

edited

Loading

hannahbrucemacdonald commented Feb 12, 2020 •

edited

Loading

jchodera commented Feb 13, 2020 •

edited

Loading

jchodera commented Feb 21, 2020 •

edited

Loading

jchodera commented Jul 19, 2022 •

edited

Loading

zhang-ivy commented Jul 19, 2022 •

edited

Loading

ijpulidos commented Jul 19, 2022 •

edited

Loading

zhang-ivy commented Jul 19, 2022 •

edited

Loading