Differentiable Simulator / Parallel Simulations / JAX #145

mstoelzle · 2022-07-18T12:37:47Z

mstoelzle
Jul 18, 2022

Is your feature request related to a problem? Please describe.
There are many applications where we would like to track gradients throughout the simulator. One example could be to optimize some of the simulation parameters such as the elastic modulus, the joint stiffness etc. to fit the recorded experimental data as closely as possible.

Another limitation of the current simulator implementation is that it cannot be run massively in parallel. Because of the Python multiprocessing limitations, it is limited to the number of cores of your CPU (to the best of my knowledge). This constrains the performance of Reinforcement Learning (RL) algorithms for example, as the number of parallel simulations is limited to just a few.
In contrast, there are examples now of simulators running on the GPU, such as Nvidia Isaac or Brax to enable massive parallelism of simulations.

Describe the solution you'd like
While I am not an expert in JAX, an implementation of the simulator in Jax could be a good way to achieve multiple improvements in one go:

Differentiability
Running the simulator on GPUs / TPUs3 / etc.
Running the simulator massively parallel on the GPU.
JIT compilation of the entire simulator

Describe alternatives you've considered
I saw on your documentation that you are currently working on a C++ implementation of the simulator. I wonder if the majority of the expected speed-ups compared to the current Numba / Python implementation could also be achieved with JAX JIT compilation? Alternatively, the gradient could also be manually tracked in C++, but that would be quite a massive undertaking I guess...

skim0119 · 2022-07-20T05:25:54Z

skim0119
Jul 20, 2022
Maintainer

@mstoelzle Thank you for addressing the idea.

Regarding optimization problem

Indeed, most of our work around elastica project involves parameter optimization. We have some internal experience attempting gradient-based optimization, but we haven't had much success. In more complex scenario, especially with multiple connections and boundary conditions, each dynamic steps are not easy to differentiate especially with SO3 operations as steppers. We didn't explore much, so I would say you are welcome to try it. We typically had a better experience using non-gradient optimization methods, such as CMA.

Regarding parallel simulation

It is up to the user to determine whether they want to utilize parallel simulation, and we don't have anything internals that prevents them from running. It should be easy to run MPI in python context with numba. It is true that Python's GIL force users to use multiprocessing, but internally the simulation uses block implementation to assure contiguous data storage. If a user wants to utilize SMP and multi-threading for vector operation, it can be done within a single simulator. SMP can probably be improved by using other package than numba, but we haven't had that many case studies that requires heavy SMP beyond LLVM provide us. We have few RL work, including this, that uses parallel simulation.

For large scaled problems, the bottleneck is typically not the Cosserat Rod simulation, rather it becomes how we resolve collision. We are developing more advanced collision and interaction algorithm with scalability, which will is mainly worked on C++ version.

The nature of our stepper typically requires multiple steps to resolve the spatial/temporal resolution, while the memory representation is very small in comparison. It put us in the situation where the process of integration is intrinsically serialized, while processing each step is typically fast enough with serial code.

Regarding GPU/TPU support

We definitely considered GPU extension before, and you are welcome to try it out, but it is not our immediate priority for a couple of reasons.

We don't have much excess to GPU node, instead we have more access to CPU clusters.
The operation we do is often intrinsically not parallelizable, especially due to having exchanges between node to element, and element to voronoi element. This offgrid representation force us to implement padded block implementation instead. We haven't explored much on converting block operations within GPU.
Our rod is 1D object with ~100 elements, so operations doesn't need much parallelization. The typical bottlenecks for scaled cases is collision and interactions, which we generally wants to avoid any $O(n^2)$ algorithms from the first place even with the parallelization. We are still in dev phase for this.
We suffer more in precision representation to deliver physically accurate simulation, which is against the current development of GPU. Our operation often have large conditioning number depend on parameters, and we are forced to use 64-float system.

Regarding C++ dev

We do expect our C++ code to be faster, but our focus is not on GPU simulation, rather it is for scalability with SMP and cluster.

Again, we haven't tested much with this code using GPU so there might be room for improvement and parallelization. If you can show some result using Jax, scalable, and parallel simulation, we can really discuss it further and work on adding the feature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiable Simulator / Parallel Simulations / JAX #145

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Differentiable Simulator / Parallel Simulations / JAX #145

mstoelzle Jul 18, 2022

Replies: 1 comment

skim0119 Jul 20, 2022 Maintainer

Regarding optimization problem

Regarding parallel simulation

Regarding GPU/TPU support

Regarding C++ dev

mstoelzle
Jul 18, 2022

skim0119
Jul 20, 2022
Maintainer