-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Parallelize critical engine functions #10217
base: master
Are you sure you want to change the base?
Conversation
I have been noticing a significant impact on speed when in battles with the Pug, because of all their Seekers. If this helps, great! |
That might help with the code style? |
Yeah, it should. I'm fighting with the containers' provided libraries for now. |
Co-authored-by: Loymdayddaud <[email protected]>
Co-authored-by: Loymdayddaud <[email protected]>
Waiting on #10218. I could add library support here, but it would be messy, as tbb would have to be pinned to a lower version, or a newer gcc would have to be imported from a third-party repo on Ubuntu 20. The compiled executable should still work on Ubuntu 20. (tbb got a refactor a while ago, but ubuntu is shipping the gcc built before the refactor, so it doesn't work too well.) |
# Conflicts: # .github/workflows/cd.yaml # .github/workflows/cd_release.yaml
Refactor
This PR refactors various engine functions for better performance.
Summary
In C++17, execution policies were introduced for many standard functions like
std::sort
,std::for_each
etc. This provides native multithreaded (or even vectorized) execution support on every platform with minimal front-end complexity.On systems that don't fully support these features, fallback algorithms were implemented using TaskQueue. This is significantly slower than the standard version, but still provides good speedup over the single-threaded version.
Using these policies,
AI::Step
,Engine::MoveShip
,Engine::DoCollisions
and some other functions were optimized, resulting in an overall 3-4x speedup using the release build (tested on my pc, without native optimizations) for larger fleets.Details
Collision handling had to be refactored to not make use of the global result lists. I'm not entirely sure why they existed in the first place, to be honest, since it didn't support querying the last collision list anyway.
AI::firingCommands
has been replaced with a local variable. (Btw, I really hope that the current locklessAI::Step
is actually reliable on all platforms, I'm not entirely sure about that. I think we are good.)Engine::visuals
andEngine::newVisuals
were migrated to usestd::list
instead ofstd::vector
to avoid unnecessary copies when parallelized. This changes the signature of a lot of other functions dealing with these variables.Some other
Engine
variables needed temporary locking capabilities for concurrent modification; a helper class was created to handle these.The
tbb
library (intel threading building blocks) was pulled in, because GCC uses it to implement the new execution policies.Detailed details
There are several challenges that come up in multi-threaded systems that we have to now deal with.
Parallel functions
Unfortunately, the support for parallel stl functions is not universal (looking at you, Apple). Also, currently, GCC's required
tbb
cannot be used in vcpkg reliably (though that might be fixed at some point). This means that some systems can't compile the fancy C++17 functions we are using.For these cases, I added another backend in
Parallel.h
. If the environment supports parallel stl, it pulls in those functions, otherwise it uses the ESTaskQueue
to parallelize tasks. While our implementation is slower, it still provides better performance than sequential operations. (This backend can also be enabled via theES_PARALLEL_USE_TASK_QUEUE
CMake option.)Parallel.h
is split into two parts: defining theparallel
namespace, and defining the parallel stl functions. The namespace is handled separately because Apple can't be arsed to implement the standards correctly.seq
,par
, andpar_unseq
These are the three execution policies supported by C++17. (We also get
unseq
in C++20.)seq
is sequential execution, with the same semantics as the normal stl functions. It's really useful for debugging.par
is for parallel execution. It's the most common option used in our case, and should work with everything out of the box that doesn't do concurrent edits, or reads from mutables.par_unseq
is for parallel execution that can be vectorized. This is only usable for really small functions in most cases. Do not use if there are locks in the code. Also, please note that Dictionary uses locks.Lists, vectors and partial guarding
For containers that are iterated over, can be edited concurrently, but are rarely edited, I added
PartiallyGuarded
variants. These variants guarantee safe access toemplace_back
, but make no such guarantees to any other function. While this is a rather niche use case, it matches exactly how we use certain Engine members for near-zero overhead.If concurrent edits are more common, lock contention would become a serious issue in the above implementation. In these cases, I created a different container for each thread, and only synchronized the additions at the end of the operation - this way, no thread would have to wait for another to finish writing. This is extremely efficient with list splicing, but gets expensive with vectors due to the extra copy introduced at the end, so some fields had to be changed to lists.
There are also collections that are safe for concurrent edits as long as we don't edit the same item twice, such as
std::map
. In these cases, it is often worth locking on the key we are editing on instead of the map, as this can greatly reduce lock contention. Long story short, Body got a mutex for every object. (Since mutexes cannot be copied, I had to add a helper class with custom copy functions.)We also need to edit ships, minables and other bodies concurrently in some cases (when they are taking damage, for example). The same per-body locks are also essential here.
Mutable members and collisions
Many classes use mutable members to cache values. One example is
AI::firingCommands
, another one is the results in collision sets. While these are nice for single-threaded performance, they can't scale to concurrent workloads, so they have to be removed.This generally means one of two things:
The first one requires more front-end changes, but avoids copies if said value is returned by the function. This was used in
CollisionSet
andAsteroidField
. (It actually reduced the number of copies in the code, since it can now append to an existing vector of collisions.)ResourceProvider
Another new class introduced is
ResourceProvider
. This is a wrapper around various per-thread resources, providing efficient resource acquirement and automatic synchronization with the non per-thread master resources. (A helper introduced for this is ResourceGuard, which guards both a lock and the associated resources. It also provides stl-style accessors to the per-thread resources.)Testing Done
I tested that the game doesn't crash during normal gameplay, and that the mechanics seem to work.
Some sub-tick behaviour might change or become non-deterministic, but no major changes should happen.
Performance Impact
A whole lot faster.