-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented support for pre-generated LibDevice PTX modules. #1148
base: master
Are you sure you want to change the base?
Conversation
b049269
to
5b588bf
Compare
Added minimum required Cuda architecture and ISA to pre-generated XML. This is then used in PTXBackend to report an error if LibDevice is called, and the embedded PTX is not compatible. |
38a2f6d
to
45d76d1
Compare
Added Tools solution to CI pipeline to ensure that breaking changes are caught. |
45d76d1
to
ac36bf2
Compare
Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12. |
b46856d
to
55d36d4
Compare
Converted to draft, so that I can refactor the code into smaller PRs that are useful to ILGPU, independent of the pre-generated PTX. |
55d36d4
to
e6cd7bd
Compare
Refactoring complete. |
Depends on #1147.
Currently, ILGPU has support for LibDevice at runtime by using NVVM to generate PTX, and then merging the generated PTX into the kernel PTX code. This requires the CUDA SDK to be installed.
This PR pre-generates the PTX code, and then embeds that into ILGPU, so that we can remove the requirement for the CUDA SDK to be installed for LibDevice support.
The first step is a tool that is manually run, on a machine with the CUDA SDK installed. The tool uses NVVM to generate the PTX, and saves it to an XML file for later consumption.
Compiling ILGPU will read this file and generate C# code using T4 templates.
ILGPU has also been updated to support combining multiple PTX modules. This is necessary to avoid clashes in the pre-generated PTX code. e.g.
Sin(double)
andCos(double)
both use the same helper functions in PTX.Finally,
Context.LibDevice()
has been marked as[Obsolete]
and is a no-op. It has been renamed toLibDeviceOverride()
if a user wants to force a different version of LibDevice at runtime.QUESTION: The PTX was generated using Cuda SDK v12 and targets SM_60. I'm assuming this means that all users will have to use at least v12 or newer? And devices that are SM_60 or newer?Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12.
Restructured ILGPU and ILGPU.Algorithms, moving the IntrinsicMath implementations into ILGPU itself. XMath has other functions that are not part of IntrisicMath, so it will stay as-is for now.
CLMath in ILGPU.Algorithms only needed to support Rcp and Log(x,y). These have been moved into ILGPU, and CLMath has been removed.
PTXMath in ILGPU.Algorithms provided a number of math functions using Cordic implemenentations. Now that pre-generated LibDevice is available in ILGPU, switched all the IntrinsicMath functions to call LibDevice for Cuda GPUs. The pre-generated LibDevice PTX code only works on >= SM_60, so the Cordic functions in ILGPU.Algorithms have been modified to only register for < SM_60. Otherwise, they are no longer used.
Unit Tests for IntrinsicMath have not been implemented. These are currently running via the ILGPU.Algorithms unit tests.
Added workaround for XMath.Pow(double, double) - the Cuda Test Runner has a different result for some combinations of inputs.