-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise Jitify Preprocessor. #602
Comments
Ok, so having looked into this further, my initial hypothesis was incorrect. The time is being taken up by NVRTC, however this because Jitify is hammering NVRTC, with many calls to
This is the offending block (line numbers might be off) |
So I've ran a test file through the python preprocessor lib In short:
|
Created a Jitify issue to see their thoughts here: NVIDIA/jitify#90 From our perspective, we should consider whether we want to go as far as including a complex library build step which generates a flattened header/s for including within RTC (using something like |
GLM is MIT, so as long as we include the glm licence file with it then it's not an issue. |
So, I've managed to get RTC to build with partially flattened headers. Create a file #define SEATBELTS 1
#define USE_GLM
#define NDEBUG
#define __CUDACC_RTC__
#define __CUDACC__
#define __CUDA_ARCH__ 50
#define __CUDACC_VER_MAJOR__ 11
#define __CUDACC_VER_MINOR__ 1
#define __CUDACC_VER_BUILD__
#define NULL nullptr
#define __cplusplus
#define _WIN64 1
#define __cdecl
#define __ptr64
#define INCLUDE_FLAMEGPU_RUNTIME_UTILITY_DEVICEENVIRONMENT_CUH_
#include "flamegpu/exception/FLAMEGPUDeviceException.cuh"
#include "flamegpu/runtime/DeviceAPI.cuh"
#include "flamegpu/runtime/messaging/None/NoneDevice.cuh"
#include "flamegpu/runtime/messaging/Bucket/BucketDevice.cuh"
#include "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.cuh"
#include "flamegpu/runtime/messaging/Array/ArrayDevice.cuh"
#include "flamegpu/runtime/messaging/Array2D/Array2DDevice.cuh"
#include "flamegpu/runtime/messaging/Array3D/Array3DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial2D/Spatial2DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial3D/Spatial3DDevice.cuh" Run (You can install Edit the output file Move the edited Update Now RTC can build agent functions with GLM in ~8 seconds, rather than 60+ seconds. This could be further improved by flattening the remaining system/cuda/curand headers, and flattening device environment header into the dynamic curve header (because it has to be included late). |
Managed to improve RTC compile times (albeit not extended to GLM) with some test code: Here are the tests and their improvements:
The hacky fix I used was to add this long block of code to // Add known headers from hierarchy
headers.push_back("algorithm");
headers.push_back("assert.h");
headers.push_back("cassert");
headers.push_back("cfloat");
headers.push_back("climits");
headers.push_back("cmath");
headers.push_back("cstddef");
headers.push_back("cstdint");
headers.push_back("cstring");
headers.push_back("cuda_runtime.h");
headers.push_back("curand.h");
headers.push_back("curand_discrete.h");
headers.push_back("curand_discrete2.h");
headers.push_back("curand_globals.h");
headers.push_back("curand_kernel.h");
headers.push_back("curand_lognormal.h");
headers.push_back("curand_mrg32k3a.h");
headers.push_back("curand_mtgp32.h");
headers.push_back("curand_mtgp32_kernel.h");
headers.push_back("curand_normal.h");
headers.push_back("curand_normal_static.h");
headers.push_back("curand_philox4x32_x.h");
headers.push_back("curand_poisson.h");
headers.push_back("curand_precalc.h");
headers.push_back("curand_uniform.h");
headers.push_back("device_launch_parameters.h");
//headers.push_back("dynamic/curve_rtc_dynamic.h"); // This is included proper below, having this makes a vague compile err
headers.push_back("flamegpu/defines.h");
headers.push_back("flamegpu/exception/FLAMEGPUDeviceException.cuh");
headers.push_back("flamegpu/exception/FLAMEGPUDeviceException_device.cuh");
headers.push_back("flamegpu/gpu/CUDAScanCompaction.h");
headers.push_back("flamegpu/runtime/AgentFunction.cuh");
headers.push_back("flamegpu/runtime/AgentFunctionCondition.cuh");
headers.push_back("flamegpu/runtime/AgentFunctionCondition_shim.cuh");
headers.push_back("flamegpu/runtime/AgentFunction_shim.cuh");
headers.push_back("flamegpu/runtime/DeviceAPI.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageArray.h");
headers.push_back("flamegpu/runtime/messaging/MessageArray/MessageArrayDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageArray2D.h");
headers.push_back("flamegpu/runtime/messaging/MessageArray2D/MessageArray2DDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageArray3D.h");
headers.push_back("flamegpu/runtime/messaging/MessageArray3D/MessageArray3DDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageBruteForce.h");
headers.push_back("flamegpu/runtime/messaging/MessageBruteForce/MessageBruteForceDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageBucket.h");
headers.push_back("flamegpu/runtime/messaging/MessageBucket/MessageBucketDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D.h");
headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D/MessageSpatial2DDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D.h");
headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh");
headers.push_back("flamegpu/runtime/messaging/MessageNone.h");
headers.push_back("flamegpu/runtime/utility/AgentRandom.cuh");
headers.push_back("flamegpu/runtime/utility/DeviceEnvironment.cuh");
headers.push_back("flamegpu/runtime/utility/DeviceMacroProperty.cuh");
headers.push_back("flamegpu/util/detail/StaticAssert.h");
//headers.push_back("jitify_preinclude.h"); // I think Jitify adds this itself
headers.push_back("limits");
headers.push_back("limits.h");
headers.push_back("math.h");
headers.push_back("memory.h");
headers.push_back("stddef.h");
headers.push_back("stdint.h");
headers.push_back("stdio.h");
headers.push_back("stdlib.h");
headers.push_back("string");
headers.push_back("string.h");
headers.push_back("time.h");
headers.push_back("type_traits"); These are all the headers reported by the keys in The issue with adding GLM to this, is that internally GLM has many relative path includes, many which map to duplicate absolute paths. It might be possible to address that by giving lots of bad include paths, but this seems grim. I think the optimal solution to GLM would be to feed glm through pcpp, as done in the above comment, to flatten glm. This could presumably be automated at cmake time. Although this wouldn't solve the issue where users wanted tertiary glm includes, which will include back in core glm headers. As Pete has pointed out on slack, we probably want to automate detection of the fgpu/curand include hierarchies, so they are stable with library changes. Best method for that requires discussion. |
Compilation with main GLM include, leads to a 63 second call to the
jitify::Program
constructor, of which it appears only 600 milliseconds is spent by NVRTC (createProgram
,compileProgram
, ...,destroyProgram
).This means the Jitify preprocessor is likely to blame. We either need to profile and optimise it heavily, or add aggressive caching to processed headers.Might be worth raising an issue on https://github.com/NVIDIA/jitify, to see if they have any thoughts on the matter. But it appears most of their attention has moved to Jitify2, so unlikely they would do any work directly on optimising the pre-processor. If taking the aggressive caching approach, might be worth getting Ben to agree whether it's something they'd be interested in merging, so we can decide whether to make our header cache internal or external to Jitify.
The text was updated successfully, but these errors were encountered: