Performance of parallel-but-closed-loop rollouts #1522

emiwar · 2024-03-18T23:35:45Z

emiwar
Mar 18, 2024

Hi,

Sorry in advance for a long post. The TL;DR is that I'm considering trying to add a threaded version of rollout function available in the python bindings to speed up batched closed-loop simulations.

We're looking to use MuJoCo for RL on a single machine/single cluster node (i.e. not distributed), using TorchRL. I naively created a DMControlEnv("humanoid", "stand") in TorchRL, but noticed that stepping it was quite slow, compared to just stepping the underlying MuJoCo model:

(Here, the torchrl.envs.ParallelEnv wrapping in the bottom bar only uses a single worker to make the overhead apparent, but even scaling it to multiple workers still appears to come with quite a big overhead).

I haven't dug into exactly what dm_control and TorchRL are doing that takes so much longer than just stepping the physics, but I imagine at least part of the issue is that the python overhead for stepping the environment, calculating rewards, checking termination criteria, etc, is quite large compared to stepping a single MuJoCo simulation a single step. Does this make sense or is there something else I'm missing?

If that is the case, then a better implementation of the environment would be to somehow make a single batch call to MuJoCo to step multiple environments, and then do all the subsequent calculations with vectorized numpy/torch. I searched and found issues #203 and #897, and that mujoco.rollout.rollout would probably be the best way to go. Especially since it could even support some basic multithreading using python's ThreadPoolExecutor without the GIL blocking too much. So, I did a naive wall-clock benchmark of this for different number of timesteps (nstep argument in rollout) and number of parallel simulations (nroll argument in rollout) and got this result:

The second panel is using a ThreadPoolExecutor as in issue #897 and rollout_test.py. The script I used is here. As I'd expect, the speed-up is greatest at multiples of the number of cores (48; hence the stripes) and is otherwise generally larger with more parallel roll-outs. But it seems that for a small number of time-steps, the overhead remains quite significant even for 1024 parallel roll-outs. Plotting the same data using lines:

My conclusion from this is that for open loop roll-outs, python threading is indeed quite efficient (as demonstrated in #897), but for closed-loop RL with intermediate batch sizes (say, 256 to 1024 parallel simulations and 1 to 5 physics steps per control step) there is some considerable overhead, making this setup around 10x slower than open-loop roll-outs. Does that sound reasonable?

If that is so, I'm thinking for our use case, there could still be a 10x speed-up if the threading was done on the c++-side instead of in python. So I'm thinking of trying to add a c++-threaded version of rollout at least for our own use (and perhaps eventually make a PR if it works well), but wanted to first hear if that would make sense, or if anyone is working on something similar?

Thanks!

yuvaltassa · 2024-03-18T23:56:22Z

yuvaltassa
Mar 18, 2024
Maintainer

I think what you are proposing makes total sense. I would also recommend that you take a look at the testspeed utility, which is probably the fastest sort of stepping one can get and can be considered an upper bound.
It sounds to me like one potential solution, which you seem to ignore, is to use much larger batch sizes. Why are you stopping at 1024? Even if there is a concrete reason for your specific use case, I think it would be very valuable to the community if you added a few more clicks to the x-axis of your beautiful 3rd plot (I mean, the 2nd plot is also beautiful, but a bit more difficult to read 🙂), perhaps up to 2^16 or so?

1 reply

emiwar Mar 19, 2024
Author

Thanks for a very quick reply!

I tried testspeed now, but it seems to only allow 512 threads?

bash-4.4$ testspeed humanoid.xml 1 65000

Rolling out 1 steps per thread, at dt = 0.005...

Summary for all 512 threads

 Total simulation time  : 0.02 s
 Total steps per second : 24927
 Total realtime factor  : 124.63 x
 Total time per step    : 40.1 µs

Details for thread 0

 Simulation time      : 0.00 s
 Steps per second     : 12327
 Realtime factor      : 61.64 x
 Time per step        : 81.1 µs

As for using really large batch sizes, yes, I guess that could be an option. I'm not sure how that's going to play with RL, both in terms of performance of network policy inference and for convergence speed. Although I guess with a big enough batch size I basically wouldn't even need a replay buffer since I can get a big batch of i.i.d. samples from just running so many environments... Hm, could be worth trying out.

Anyways, for the community, here is the continuation of the plot above. I also tried using multiprocessing to see if it would make a difference at the really large batch sizes, but it doesn't seem to matter too much which is used:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of parallel-but-closed-loop rollouts #1522

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Performance of parallel-but-closed-loop rollouts #1522

emiwar Mar 18, 2024

Replies: 1 comment · 1 reply

yuvaltassa Mar 18, 2024 Maintainer

emiwar Mar 19, 2024 Author

emiwar
Mar 18, 2024

Replies: 1 comment 1 reply

yuvaltassa
Mar 18, 2024
Maintainer

emiwar Mar 19, 2024
Author