-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_program() performance (with large include hierarchies) #90
Comments
Thanks for this feedback, it's an important issue. The situation is a bit tricky. At its heart is the fact that having NVRTC load headers implicitly from the filesystem (which it does support) is inflexible in the following ways:
For these reasons, our current approach is to identify and pre-load all headers and pass them to NVRTC explicitly. This gives us full control over how headers are handled, but requires a means of identifying the required headers. There are two general ways we could go about solving this: a) Implement a more efficient method for identifying header dependencies. I've been thinking about (a) for a little while and there might be some things we can do, but it might take some time. On the other hand, it may be time to reconsider using (b). E.g., we could:
We'll think about these ideas. Let us know if you have any additional feedback. Another thing to consider is that we've been moving toward a new workflow where all sources and headers are pre-processed offline and shipped with the application. I'd be interested to hear if such a workflow would work well for your application or not. (It does require that all the headers have redistributable licenses). |
Thanks, I'm glad this is something you've thought about.
This was my initial thought, but as I thought about it more it simply seems impractical. To correctly identify all header dependencies, you would want a full C pre-processor. Going shorter than this would create edge cases, where it's either too aggressive/soft (finding unwanted headers, or missing wanted headers) which are bound to break a small amount of header libraries people may want to (try and) use (e.g. supporting GLM took a few changes to both libraries). Writing your own C pre-processor seems like a bad idea, full support would include parsing and handling things like arithmetic and bitwise operators, and it just creates a new complex thing to support and maintain. Similarly, leaning on a 3rd party pre-processor, would probably break from your current ethos of 'Jitify is a single-file header library'. Which is what lead to my thought of leaning on NVRTC (your Option B), which presumably uses a fork of Clang's pre-processor similar to NVCC.
We currently generate 1 dynamic header, which is not on the file-system, so this would create a problem for us. However, I expect we could find a way to prepend that header to the user's provided source (similar to how we already prepend the required includes).
Yes I hope this is possible, even if it means storing them on disk and specifying the directory as an include dir. Worst case you could potentially raise it with the NVRTC team to find a compromise?, e.g. load from in-memory and disk, giving in-memory higher priority.
I'm not familiar with this feature, so we've probably not had reason to use it in our use-case. But if it's what I'm assuming, then dropping support would move usage closer to a traditional compiler, which seems reasonable.
Are these not the same thing? e.g. Jitify see's a missing header and comments it out and tries again? Warnings/errors about missing headers then aren't visible (by default?) unless Jitify's compilation loop reaches a fail state. This is possibly the difficult point, but I presume existing behaviour could be maintained by instead providing NVRTC an empty string/file, for the missing header, and repeating compilation similar to the existing loop. It could then also be useful to clearly document a method for a user to specify/dummy missing headers ahead of time, to allow users with a fixed header stack to streamline compilation. These dummied headers could also optionally be cached, so subsequent compilations during runtime attempt to use them. (e.g. in our use-case, we explicitly keep everything as separate compilation units as it allows us to dynamically generate 1 header which greatly improves performance, even over non-RTC builds).
Pre-processing source offline, seems antithetical to the purpose of runtime compilation. Surely, if you're shipping a compiled binary, you might as well use NVCC, the only benefit being the user doesn't additionally require gcc/visual-studio? In our main use case for RTC, users will be implementing their own model via our Python interface by writing agent functions in CUDA. So they would be writing/developing/running the CUDA source from within Python, rather than running any CUDA we explicitly provide to them, they simply use the API we provide which is what's within the header hierarchy. So a pre-process stage seems redundant, because they would need to be doing that directly before running it. We do already cache the RTC ptx(?) to disk, so repeated runs with unchanged source can bypass RTC, but in development recompilation is regularly required. As I said in my original post, it's possible we could pre-process (or rather flatten) most of our includes. The only external (non NV) headers we use are GLM, which are under an MIT license. If pre-processing were to include compilation though, we'd probably avoid that as our headers are heavily templated, and ideally fully inlined at compile time. Not sure this is relevant, but it's possible we will have users who wish to include additional header libraries, for some advanced math functionality or similar. But that's probably an extreme edge-case at this current point (and they'd probably run into difficulties, similar to those I had getting GLM to work). |
This will still be possible by providing the header explicitly.
NVRTC already works like this actually. There are some potential issues with name collisions but apart from that it should work.
It doesn't ship a compiled binary, it just ships all of the source and headers along with (e.g., embedded in) the application. The offline preprocessing step just does what preprocessing currently does: identifies and loads (and patches) all of the required headers and packages them together (similar to flattening) for shipping with the application. You can still add extra headers and instantiate templates at runtime.
We can probably support this but I'll have to test it to make sure. |
Hi @Robadob, I also identified this performance issue in CuPy a while ago. If you're still on Jitify v1, and if the slowdown you're seeing happens every time a new kernel is being compiled but most of the headers it sees are identical (i.e. your headers can be cached and reused), then I figured out a workaround that @benbarsdell merged for me: #82. To understand our workaround requires a bit of context: CuPy already heavily uses NVRTC, probably before Jitify started, so all we need is Jitify's compile-fail-search loop and its (amazing) patches to all the C++ std headers; we don't need the rest of its functionalities like compilation and kernel cache, for example, as we have them all already. With this in mind, our workaround uses Jitify's internal function directly, and caches all the headers that Jitify found so that we only pay the search cost once. Future compilations will just reuse the searched headers. In fact, based on your description, I think you can just preload all of your headers, which would even allow you to not pay the first time cost. If you're interested, the relevant code in CuPy are here: I've been wondering if the same strategy also applies to Jitify v2, but unfortunately I likely do not have time to look into this for now. |
This sounds like it could work for us then, just a case of working out how we'd fit it into our build system. (Not likely to be a problem, would just mean extending the CMake scripts)
As I said, it's a possibility, but far from a priority to us. Your workaround makes sense, and it would probably be more straight forward for us to apply. However as you identify, we would still have the ~60s compilation of the first unit. So this would be great for our test suite, and large models with many units to be compiled, but still somewhat prohibitive for smaller models with only a small number of units to compile. We're still using Jitify 1, and interface with it via C++ (our Python interface is mostly generated by SWIG). But the code you've provided will be useful if we decide to go this route. Thanks both of you, I'll discuss these options with my team in the coming weeks. |
So I started looking at this again today. I can't make much sense of #82, but from leo's other sources it appears they're using Jitify differently to us (we construct a However, I did find that if we specify expected header file names (based on the keys from
I've not looked at the tests in detail, but I would assume the first compiles a single kernel, and the second perhaps four. The 3rd test however uses glm, the initial reason I identified this problem. Internally, GLM uses relative paths to it's includes. This causes the above mentioned Not sure I'll continue playing with this next week, was just a quick test of something that came up in discussion with someone from my team. |
When I include the main glm header within an RTC source, the runtime of
load_program()
increases to over 60 seconds (from what was already ~10 seconds), so I started investigating the problem.I discovered that to discover include files, Jitify calls
nvrtcCompileProgram()
, catches the failure and parses the missing include from the error log, and then repeats the process until compilation succeeds or fails for a different reason.GLM's include hierarchy adds ~109 separate glm header files, and we have ~20 internal library headers which are included too (although we recently dynamically pruned 20 down to ~14 for a small improvement). (I calculated these values with output from
pcpp
so they might be a little off as I haven't tested the flattened file it output)The problem is, that each call to
nvrtcCompileProgram()
causes it to reparse the include hierarchy, so the cost grows from an initial ~100ms, to ~600ms as each header is added. I logged it doing 198 failed calls tonvrtcCompileProgram()
. This is highly inefficient, leading to theload_program()
function taking 60+ seconds with the final successful nvrtc call taking around 1 second.In comparison
pcpp
, was able to pre-process the full include hierarchy in 3 seconds. So it's fair to assume it could theoretically be compiled in 4 seconds with appropriate include pre-processing, 15x faster.In our use-case, we perform multiple individual RTC compilations with the same include hierarchy, so we have this unnecessary cost. Worst case being our test suite, 85 minutes total with
glm
included in all agent functions, 25 minutes total withglm
only included where used and 11 minutes withglm
tests and include disabled. But even in that case, probably 10 minutes are spent doing RTC for the small number of RTC tests (we have most the RTC tests in our python test suite).I'd be interested to know your thoughts on whether Jitify can address this, or even whether NVRTC could be extended by the team who develops it to actually load include files from disk (given implementing a full pre-processor within Jitify is impractical) and perhaps also mimic Jitify's missing include ignoring behaviour.
The optimal performance in our case (FLAMEGPU2) would probably be achieved by pre-processing and flattening the header hierarchy that we pass to all RTC sources (to even reduce the ~3s of header parse/loading each time), but the high compile costs due to include processing might be scaring other users away from using RTC if they're finding it via Jitify and building things with larger include hierarchies (In our case, we've always had the ~20 library headers and thought the ~10s expense was just an unfortunate byproduct of RTC for our non-simple use-case).
Irrespective of NVRTC, Jitify might be able to improve the cost of multiple compiles, by caching loaded headers and always passing them to NVRTC regardless of if they're required? This would still leave the 1st compilation costly, but would make any subsequent compilations with the same headers much cheaper.
As I quick test, I was able to flatten the majority of our includes using pcpp, and this reduces the number of false calls to
nvrtcCompileProgram()
from 198 down to 32. I believe these remaining 32, are due to our dynamic header, which has a delayed include which can't be flattened the same, and the various system includes which I haven't flattened yet (there are around 60, but I think they're mostly repeats). But already the RTC time was down to ~8 seconds.The text was updated successfully, but these errors were encountered: