Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Sorry in advance for a long post. The TL;DR is that I'm considering trying to add a threaded version of
rollout
function available in the python bindings to speed up batched closed-loop simulations.We're looking to use MuJoCo for RL on a single machine/single cluster node (i.e. not distributed), using TorchRL. I naively created a
DMControlEnv("humanoid", "stand")
in TorchRL, but noticed that stepping it was quite slow, compared to just stepping the underlying MuJoCo model:(Here, the
torchrl.envs.ParallelEnv
wrapping in the bottom bar only uses a single worker to make the overhead apparent, but even scaling it to multiple workers still appears to come with quite a big overhead).I haven't dug into exactly what dm_control and TorchRL are doing that takes so much longer than just stepping the physics, but I imagine at least part of the issue is that the python overhead for stepping the environment, calculating rewards, checking termination criteria, etc, is quite large compared to stepping a single MuJoCo simulation a single step. Does this make sense or is there something else I'm missing?
If that is the case, then a better implementation of the environment would be to somehow make a single batch call to MuJoCo to step multiple environments, and then do all the subsequent calculations with vectorized numpy/torch. I searched and found issues #203 and #897, and that
mujoco.rollout.rollout
would probably be the best way to go. Especially since it could even support some basic multithreading using python'sThreadPoolExecutor
without the GIL blocking too much. So, I did a naive wall-clock benchmark of this for different number of timesteps (nstep
argument inrollout
) and number of parallel simulations (nroll
argument inrollout
) and got this result:The second panel is using a
ThreadPoolExecutor
as in issue #897 androllout_test.py
. The script I used is here. As I'd expect, the speed-up is greatest at multiples of the number of cores (48; hence the stripes) and is otherwise generally larger with more parallel roll-outs. But it seems that for a small number of time-steps, the overhead remains quite significant even for 1024 parallel roll-outs. Plotting the same data using lines:My conclusion from this is that for open loop roll-outs, python threading is indeed quite efficient (as demonstrated in #897), but for closed-loop RL with intermediate batch sizes (say, 256 to 1024 parallel simulations and 1 to 5 physics steps per control step) there is some considerable overhead, making this setup around 10x slower than open-loop roll-outs. Does that sound reasonable?
If that is so, I'm thinking for our use case, there could still be a 10x speed-up if the threading was done on the c++-side instead of in python. So I'm thinking of trying to add a c++-threaded version of
rollout
at least for our own use (and perhaps eventually make a PR if it works well), but wanted to first hear if that would make sense, or if anyone is working on something similar?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions