Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica propagation times systematically change in cyclic patterns that may slow simulations #613

Closed
jchodera opened this issue Dec 21, 2019 · 133 comments
Assignees
Labels
optimization Performance optimization priority: high priority high

Comments

@jchodera
Copy link
Member

In plotting the total replica propagation time per iteration from one of @hannahbrucemacdonald's experiments (/data/chodera/brucemah/relative_paper/dec/jnk1/lig20to8/complex_0.stderr), it appears that there is a pattern:
image (5)
Connecting successive iterations with lines reveals a clearer pattern:
image (6)
Zooming in shows the pattern more clearly:
image (7)
This is currently just something intriguing, but something we should investigate in the new year since it may lead to speed improvements if we can understand what is going on here.

@jchodera
Copy link
Member Author

jchodera commented Feb 3, 2020

@hannahbrucemacdonald : If you have a chance before I do, here’s what I was doing to look into the variable timing issue.
I inserted these lines into my in-place installation of openmmtools. You can find where this is installed with

python -c "from openmmtools import multistate; print(multistate.__file__)"

I edited multistatesampler.py and inserted the following at the end of _propagate_replica
at line 1235:
https://github.com/choderalab/openmmtools/blob/master/openmmtools/multistate/multistatesampler.py#L1235
Here are the lines:

 # DEBUG: Report nvidia-smi                                                                                                                                                                                                                                                                                                                                          
 import subprocess
 process = subprocess.run(['nvidia-smi'], stdout=subprocess.PIPE, universal_newlines=True)
 logger.debug(process.stdout)

This should write the output of nvidia-smi each replica propagation step to report current temperature, voltage, and speed:

Tue Jan 21 12:19:31 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:02:00.0 Off |                  N/A |
|  0%   31C    P8     5W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The relevant numbers are
• the drive rversion
• the GPU type
• the fan %
• the temperature
• the power usage
• the process table

@hannahbrucemacdonald
Copy link
Contributor

Did you only see this in complex, or solvent and vacuum too?

@hannahbrucemacdonald
Copy link
Contributor

hannahbrucemacdonald commented Feb 11, 2020

Hey - I've got so far as to run a short bit of complex with Thrombin and here are the minimal plots. Looks like we have oscillations not correlated to the GPU temp. It looks like it's stabilising around 80C which should be fine. I will have a go with the other parameters that you've suggested too

System: Thrombin
GPU type: NVIDIA-SMI 418.39
Driver version: 418.39

  • the driver version

  • the GPU type

  • the fan %

  • the temperature

  • the power usage

  • the process table

image

image

(edited to fix x-axis labels)

@hannahbrucemacdonald
Copy link
Contributor

image

Fan speed is quite variable, but also doesn't seem to correlate

@jchodera
Copy link
Member Author

Are you plotting this every iteration or every replica propagation time? I was hoping to get more fine grained information by printing the output of nvidia-smi every replica propagation substep

@hannahbrucemacdonald
Copy link
Contributor

Ahhhh I see what you mean, I have that I think - give me a min.

@hannahbrucemacdonald
Copy link
Contributor

hannahbrucemacdonald commented Feb 11, 2020

Ok so I have the time of each replica iteration from the nvidia-smi output, but it's in seconds, while each replica propogation is ~1 second. I can't see anything useful about each replica taking longer or not because of the rounding

image

Unless I'm getting the wrong end of the stick.

I can increase the amount of MD (~factor 10?) per step to push this up? We currently do... 250*4 fs per replica per iteration (1 picosecond).

@jchodera
Copy link
Member Author

Did you only see this in complex, or solvent and vacuum too?

@hannahbrucemacdonald : Great question! I used this jupyter notebook to generate this plot for all phases from /data/chodera/brucemah/relative_paper/dec/jnk1/lig20to8/---just change the paths in the notebook to plot a different calculation!

GitHub seems to have trouble with image uploads right now, but check out the notebook:

  • cyclic behavior is only observed with complex
  • the iteration time for solvent keeps growing with iteration number---is this a NetCDF issue?
  • the vacuum time keeps growing, but also switches back and forth between fast and slow in long time blocks (!!!); are we using the CPU or Reference platform for vacuum?

@jaimergp
Copy link
Member

That temperature plot looks suspicious to me. Temp throttling can definitely happen at intervals.

Some data pulled from this article:

Ideally you want temperatures to be as low as possible, anything below 80 degrees is normal and should keep throttling in check. Nvidia's GTX 1080 Ti, for example, has a throttling point of 84 degrees. If you keep the temperature below 80 degrees you leave yourself with a bit of breathing room, so you can focus on having fun instead of monitoring GPU frequencies.

It's important to remember that every graphics card has a different throttling point. The previous-gen GTX 980 and 970, for example, throttle at 80 degrees, while AMD's Vega series cards can reach a maximum temperature of 85 degrees before they throttle. You will need to find out the throttling point for your specific card in order to set an effective fan curve and voltage.

So we might be hitting 80ºC, decreasing performance to "calm down" (but keeping the temperature just a bit below 80ºC!), then we think we can use more power and hit 80ºC, chill down, etc.

@jaimergp
Copy link
Member

Power usage might give more details.

@hannahbrucemacdonald
Copy link
Contributor

hannahbrucemacdonald commented Feb 12, 2020

Sorry got a little further --- but not much - I can expand get_details() to wrangle in more details

biggest addition is looking at /data/chodera/brucemah/relative_paper/amber_starting/off100/thrombin/test/lig1to10/

h_edits_to_ipynb.zip

timestep is still too small to really see the finetooth timing... I'll run something slower in the same directory - hopefully it will be there for european morning time!

@hannahbrucemacdonald
Copy link
Contributor

Hopefully there is some more output in /data/chodera/brucemah/relative_paper/amber_starting/off100/thrombin/test/SLOWER1to10 for you to see in the AM!

@jchodera
Copy link
Member Author

So we might be hitting 80ºC, decreasing performance to "calm down" (but keeping the temperature just a bit below 80ºC!), then we think we can use more power and hit 80ºC, chill down, etc.

This is my hypothesis exactly!

@jchodera
Copy link
Member Author

jchodera commented Feb 13, 2020

It really looks like the temperature is causing the driver to throttle the speed here:
image
To prove this, we also need you to pull the power usage. That's essentially a direct readout of the clock speed.
image
Here, the power usage is the "46W".

Could you also pull the node name(s), the driver version, and GPU type?

Once we have some good data compiled, we can ask the sysadmins to tweak the driver settings a bit to minimize this.

@jchodera
Copy link
Member Author

There's more info on nvidia-smi and its options here:
https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/

There are some useful options here:
image

@jaimergp suggests that we especially want to monitor the Power State to make sure it is always in P0. When it hits a thermal limit, it will switch into lower power states (P1 or P2) until it returns to the thermal envelope that P0 is allowed.

The sysadmins have control over the power profile to modify the policy used to change power states.
Here's a guide from NVIDIA.

@hannahbrucemacdonald
Copy link
Contributor

Cool thanks - I'll look into

  • power usage
  • node name
  • driver version
  • GPU type
  • Power State

@hannahbrucemacdonald
Copy link
Contributor

Back working on this now Lilac is back --- Power state is depreciated and now called Performance state. It seems at first glance to always be at P2

@jchodera
Copy link
Member Author

@hannahbrucemacdonald :
Can you try sampling with nvidia-smi -q -d POWER,CLOCK,TEMPERATURE,PERFORMANCE?

(base) [chodera@lilac:chodera]$ nvidia-smi -q -d POWER,CLOCK,TEMPERATURE,PERFORMANCE

==============NVSMI LOG==============

Timestamp                           : Thu Feb 20 14:04:14 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:84:00.0
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    Temperature
        GPU Current Temp            : 37 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 8.62 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Power Samples
        Duration                    : 118.14 sec
        Number of Samples           : 119
        Max                         : 9.79 W
        Min                         : 8.52 W
        Avg                         : 8.79 W
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    SM Clock Samples
        Duration                    : 257.12 sec
        Number of Samples           : 100
        Max                         : 1822 MHz
        Min                         : 139 MHz
        Avg                         : 400 MHz
    Memory Clock Samples
        Duration                    : 257.12 sec
        Number of Samples           : 100
        Max                         : 5508 MHz
        Min                         : 405 MHz
        Avg                         : 997 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

and look at these things:

  1. Check performance state (which isn't always P2, but should always be P0 for maximum performance):
    Performance State               : P8
  1. Compare current temp vs slowdown/shutdown temps?
* GPU Current Temp            : 40 C
* GPU Shutdown Temp           : 96 C
* GPU Slowdown Temp           : 93 C
  1. Compare power draw vs power limit?
        Power Draw                  : 9.02 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
  1. Compare clocks:
    Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz

There is useful info on ensuring we can get maximum performance here:
https://devtalk.nvidia.com/default/topic/892842/cuda-programming-and-performance/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/

@jchodera
Copy link
Member Author

Note that step 1---just looking to make sure we are always in P0---may be sufficient here! If we're not, we should post on the HPC slack and have them fix this.

@hannahbrucemacdonald
Copy link
Contributor

^ I'm saying that we are always in P2 i.e. from the first step

@jchodera
Copy link
Member Author

From my reading, I've determined

  • P0 is the highest performance state, but GTX 1080 / 1080Ti cards have drivers that force them into lower-performance P2 when they detect compute loads
  • There is a Windows utility that can disable this, but there is no corresponding linux utility (yet)
  • There may still be additional throttling going on due to temperature, which looks like what is going on in your plots. We may still be able to change a setting which minimizes this throttling behavior. I'm looking into this further.

@jchodera
Copy link
Member Author

jchodera commented Feb 21, 2020

Here's what I've found so far:

  • This detailed description of inspecting and controlling your GPU settings on linux is useful, but seems limited in what can be done with GTX cards compared to Tesla cards (e.g. V100)
  • Tesla cards also support fancy datacenter GPU manager software, but GTX cards appear not to
  • There are some driver settings for GPU boosting that may be relevant, but I'm not sure if they work on the GTX cards
  • This post shows some ideas on how to increase the memory clock speed
  • This thread discusses the issues of being locked into P2 in more detail
  • This thread specifically talks about being locked into P2 in compute mode on GTX 1080 Tis
  • This thread presents an intriguing solution that uses a driver setting that appears to be totally undocumented (modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0") to force into P0:
rmmod nvidia; modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0"; nvidia-smi -pm 1; sleep 30; nvidia-smi 

Edit: It appears that this is described as an option for the windows-only nvidiaInspector, but the above presents a way to do this in linux.

  • This post talks more about power mizer settings and how to disable it.

@hannahbrucemacdonald
Copy link
Contributor

Just had a chat with Sam Ellis (MolSSI) about this --- he suggested that we try run with OpenCL to just sanity check if it's a cuda issue or not? He wasn't sure, but seeing as we've tried it on a few different hardwares, he said it might be worth narrowing down if it's more on the software side of things.

Also - I haven't run this since upgrading to cuda/10.1 and openmm 7.4.1, which I guess you have been running?

@jchodera
Copy link
Member Author

@peastman : This is fantastic! This makes a great deal of sense, and I think fully explains the odd behavior we've observed.

Is there a way to trigger the reordering via the API? If so, we could call this every time we call context.setPositions(). Or would it make sense to automatically reorder after updating the positions?

If not, we might be able to restructure the calculations to try to use a separate Context for each replica if the GPU has space, to eliminate the need for atom reorderings.

@jchodera
Copy link
Member Author

jchodera commented Jul 19, 2022

@peastman : It looks like CudaUpdateStateDataKernel::setPositions does call cu.reorderAtoms(), but because reorderAtoms() only actually reorders atoms if stepsSinceReorder < 250, since it doesn't first reset stepsSinceReorder to a high number, this will rarely actually cause the atoms to be reordered. This seems like unintended behavior, and that adding cu.setStepsSinceReorder(99999) right before cu.reorderAtoms() would ensure the atoms are correctly reordered after setPositions() is called.

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Jul 19, 2022

@peastman : Thanks so much for identifying this!

Is there a way to trigger the reordering via the API? If so, we could call this every time we call context.setPositions(). Or would it make sense to automatically reorder after updating the positions?

Atom reordering seems quite expensive, so although this might remove the cycling problem, the average iteration time will probably be much longer, right?

If not, we might be able to restructure the calculations to try to use a separate Context for each replica if the GPU has space, to eliminate the need for atom reorderings.

This seems like a better solution

but because reorderAtoms() only actually reorders atoms if stepsSinceReorder < 250, since it doesn't first reset stepsSinceReorder to a high number, this will rarely actually cause the atoms to be reordered. This seems like unintended behavior,

I think it reorders atoms if stepsSinceReorder >= 250, and every time it is < 250, it increments stepsSinceReorder by 1, so its just reordering every 250 steps:

void ComputeContext::reorderAtoms() {
    atomsWereReordered = false;
    if (numAtoms == 0 || !getNonbondedUtilities().getUseCutoff() || stepsSinceReorder < 250) {
        stepsSinceReorder++;
        return;
    }
    atomsWereReordered = true;
    stepsSinceReorder = 0;
    if (getUseDoublePrecision())
        reorderAtomsImpl<double, mm_double4, double, mm_double4>();
    else if (getUseMixedPrecision())
        reorderAtomsImpl<float, mm_float4, double, mm_double4>();
    else
        reorderAtomsImpl<float, mm_float4, float, mm_float4>();
}

@ijpulidos
Copy link
Contributor

ijpulidos commented Jul 19, 2022

@peastman These are great points you are making. It makes a lot of sense. Was this sampling algorithm something that has been changed in the dev version of Openmm? Or in other words, why is this issue not being shown in the current 7.7 release?

On the other hand, there used to be a time when if we specified a context_cache to the MCMCMove object then they were deep copied. But then this generated the memory issue in choderalab/openmmtools#521. Since then, many of the things related to the context cache objects have been simplified. That said, I guess we can still try to do this in a more intelligent way or have another parameter that can control how many replicas get copied with an independent context or something similar.

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Jul 19, 2022

. Was this sampling algorithm something that has been changed in the dev version of Openmm? Or in other words, why is this not being shown in the current 7.7 release?

I think by "sampling algorithm", he means the details of how we are running repex, which is outside of openmm (in openmmtools). But Iván does raise a good point -- are there any changes to atom reordering that were added between 7.7 and 7.7.0.dev1? If not, then I wonder if there is a different change that is causing the difference in behaviors between the two openmm versions (when using 2 MPI processes)?

@jchodera
Copy link
Member Author

I think by "sampling algorithm", he means the details of how we are running repex, which is outside of openmm (in openmmtools).

We have two separate stages:

propagate_replicas() loops through each replica, potentially reusing a single Context:

for replica in replicas:
    set positions to replica
    take n_steps of dynamics

If the atoms are not reordered after the new replica positions are set, some fraction of those steps will be taken with bad atom orderings, resulting in slow dynamics. In the worst case, all steps will be slow. If we're using a separate ContextCache for propagation, the phase shift each iteration would be n_replicas * n_steps % 250, where the higher the phase shift, the slower the calculation. This would cause the periodic behavior we are observing in the total iteration propagation time. Triggering atom reordering after updating positions would eliminate this cycling, as would caching a separate context for each replica.

For energy computation, we currently use the idiom

for state in states:
    for replica in replicas:
        set positions for replica
        compute energies in state

This is means that energy evaluation will also take longer since we're always using the suboptimal atom orderings. Instead, we should invert this order, or cache a separate context for each replica.

Atom reordering seems quite expensive, so although this might remove the cycling problem, the average iteration time will probably be much longer, right?

Running up to 250 steps with bad atom orderings might be much more expensive than reordering atoms once at the beginning and using the optimal atom ordering for the replica propagation.

@jchodera
Copy link
Member Author

@peastman : One other thought: Why fix the atom reordering interval at all? If it can have such a big impact on performance, wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

@ijpulidos
Copy link
Contributor

Triggering the reordering of atoms does work around the issue. I tried what @jchodera suggested in this comment and the results are as follow. Sorry for the crowded plot but I think it's readable enough.
image
The ones labeled with "fix" are the ones with the change, the impact on performance is not too much, and it's also better if we use the UseBlockingSync: false option (maybe we should make this the default?).

The localbuild is openmm at the commit @zhang-ivy pointed in a previous comment, which corresponds to her environment and to the openmm-7.7.0-dev1 package at conda-forge.

@mikemhenry
Copy link
Contributor

@peastman : One other thought: Why fix the atom reordering interval at all? If it can have such a big impact on performance, wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

It is a lot of work, but I've seen other MD engines auto-tune the neighbor list parameters, the advantage of auto-tuning is that there are many choices in neighbor list parameters that are very system dependent.

@peastman
Copy link

are there any changes to atom reordering that were added between 7.7 and 7.7.0.dev1?

Not that I know of. Clearly something changed, since the behavior is slightly different. But it seems to be a change for the better, since average performance is now slightly faster than before.

wouldn't a simple heuristic to auto-tune the atom reordering interval have the potential for significant performance impact?

The challenge is to come up with a heuristic that's efficient in all cases, or at least not too horribly inefficient. And of course, that requires prioritizing which use cases are most important and which are less important.

the advantage of auto-tuning is that there are many choices in neighbor list parameters that are very system dependent.

In this case, the thing we need to adapt to is the pattern of what methods you invoke in what order. When it's just a matter of looking at simulation settings, you can analyze them and try to pick optimal parameters. But each time you call setPositions(), the "optimal" choice depends on what other methods you're going to call later. If you're going to do just one force evaluation then call setPositions() again, it's best to stick with the current non-optimal ordering. It will still be faster than reordering just for one evaluation. If you're going to run a lot of dynamics starting from it, then it's faster to reorder so all the subsequent steps will be faster.

The current behavior compromises between the two. It will do at most 250 inefficient steps before reordering again. Unfortunately, that just happens to be the exact number of steps the above script executes before switching replicas, which is maybe the worst possible choice. If it did 1000 steps then at least 75% of steps would be guaranteed to have an efficient ordering.

The question is whether we can find a different behavior that makes this particular use case faster without making a more important use case slower, or making a less important but still relevant use case dramatically slower.

@jchodera
Copy link
Member Author

It sounds like it would be more performant to keep track of the number of steps of dynamics since the last setPositions() or reorderAtoms() call, and reorder atoms in the Integrator.step() call instead.

Even better would be to auto-tune based on the number of steps of dynamics since the last setPositions() or reorderAtoms(), since this could be adaptive to the individual system and integrator used and avoid significant speed degradation.

It could potentially be useful to use total atomic displacements, or pairlist updates, but it might be hard to find a universal heuristic for those. Just auto-tuning based on number of steps since last setPositions() or reorderAtoms() seems like the best choice.

@zhang-ivy
Copy link
Contributor

Some notes from our perses dev sync this past Monday:

What are we going to do about repex cycling issue?

  • In openmm -- still working with Peter to determine what the appropriate fix in OpenMM would be
  • In openmmtools -- we may want to restructure the MultiStateSampler to have a mode that allows for a separate context cache for each replica, though having a context cache for each replica may cause memory issues and slowdowns in other ways. Would need to profile this more to determine whether this mode would remove the cycling problem without causing memory/slowdown issues.
  • Without openmm or openmmtools changes -- we could force the n_mpi_processes == n_replicas. This would basically be the same as creating a new context for each replica.

I tried the last option, and here are the results:
image

In each of the above experiments, I am running t-repex on barnase:barstar using 2 GeForce RTX 2080 Ti and n_replicas = 24 with OpenMM 7.7.0.dev1. Though using 24 MPI processes causes the cycling problem to disappear, it seems like the overhead of using so many processes causes a slowdown that is not worth it.

@peastman
Copy link

As a short term workaround, is it reasonable to increase the number of steps per iteration? Was 250 chosen for a good reason, or is it arbitrary? A larger number would reduce the impact of reordering and lead to faster overall performance.

@jchodera
Copy link
Member Author

jchodera commented Jul 28, 2022 via email

@peastman
Copy link

I have some preliminary ideas about how to address this. I want to take a little more time to let them solidify.

Before implementing anything, we need to identify all relevant use cases and make sure we have good benchmarks for them. We need to be sure we don't accidentally make something slower. I think these are the main cases to consider.

  • You call setPositions(), then run a lot of steps of dynamics. (This is the most important case, since it's what most simulations do.)
  • You call setPositions(), then run just a small amount of dynamics before changing them again. (This is what you do in this case.)
  • You repeatedly call setPositions(), always with positions that are very similar to each other, then do only a single force or energy evaluation for each one. (LocalEnergyMinimizer does this, so it's another very important case.)
  • Same as above, but successive positions may be very different from each other. (For example, you might be analyzing conformations generated by some other algorithm or saved from an earlier simulation.)

@jchodera
Copy link
Member Author

I think these are the main cases to consider.

This list looks reasonable to me!

I'll tag @mikemhenry @ijpulidos @zhang-ivy @dominicrufa in case they have anything to add.

@zhang-ivy
Copy link
Contributor

That list covers all the cases I'm interested in -- the last case is something we do in our repex code as well -- for each iteration, we compute the potential energy of each replica at each state. And the first case also covers nonequilibrium switching.

@peastman
Copy link

Ok, thanks. I think the behavior we want is something like this.

  1. If you call setPositions(), and then immediately follow with a call to step(), we should immediately perform reordering. If you take one step, you're probably going to take lots of steps.
  2. If you call setPositions() and then getState(), we should not do reordering. You may only intend a single evaluation with those coordinates.
  3. But in case 2, we still need to remember that you set the positions with setPositions(). If you follow getState() with step(), at that point we should immediately do reordering.
  4. If the positions get updated through any means other than a call to setPositions(), we should maintain the current behavior. Lots of classes update positions in other ways through internal APIs, but usually only for small changes to the positions.
  5. Preferably this should be done in a way that maintains source compatibility with existing plugins.

@jchodera
Copy link
Member Author

This sounds eminently reasonable.

Any thoughts on allowing the reordering interval to be set manually as a platform property? That should be relatively straightforward and help us discover if auto-tuning would provide significant advantages.

It would still be nice to have a manual way to force reordering via the API, but given this is a platform implementation detail, it doesn't seem like there is a clean way to introduce this into the (platform-agnostic) OpenMM API.

@peastman
Copy link

Any thoughts on allowing the reordering interval to be set manually as a platform property?

Let's first see if we can make it work well without the need for that. When you expose an internal detail like that, it becomes harder to change the implementation later. And then people specify a value for it without understanding what it does, just because they saw someone else do it in a script, and they get worse performance because of it.

@peastman
Copy link

I implemented the logic described above in #613 (comment). It mostly works, but it ends up having unintended consequences. Every time the barostat runs, it triggers a call to setPositions(), which means reordering gets done much more often before. I could work around that, but I think it would be good to consider other approaches as well. For example, I'm thinking about a method that would monitor the size of the neighbor list and trigger a reordering whenever it sees the neighbor list has grown too much. The goal is to find something robust that will automatically do the right thing in nearly all cases.

@zhang-ivy
Copy link
Contributor

Closed via openmm/openmm#3721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Performance optimization priority: high priority high
Projects
None yet
Development

No branches or pull requests

10 participants