Implemented support for pre-generated LibDevice PTX modules. #1148

MoFtZ · 2024-01-12T04:40:48Z

Depends on #1147.

Currently, ILGPU has support for LibDevice at runtime by using NVVM to generate PTX, and then merging the generated PTX into the kernel PTX code. This requires the CUDA SDK to be installed.

This PR pre-generates the PTX code, and then embeds that into ILGPU, so that we can remove the requirement for the CUDA SDK to be installed for LibDevice support.

The first step is a tool that is manually run, on a machine with the CUDA SDK installed. The tool uses NVVM to generate the PTX, and saves it to an XML file for later consumption.

Compiling ILGPU will read this file and generate C# code using T4 templates.

ILGPU has also been updated to support combining multiple PTX modules. This is necessary to avoid clashes in the pre-generated PTX code. e.g. Sin(double) and Cos(double) both use the same helper functions in PTX.

Finally, Context.LibDevice() has been marked as [Obsolete] and is a no-op. It has been renamed to LibDeviceOverride() if a user wants to force a different version of LibDevice at runtime.

~~QUESTION: The PTX was generated using Cuda SDK v12 and targets SM_60. I'm assuming this means that all users will have to use at least v12 or newer? And devices that are SM_60 or newer?~~
Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12.

Restructured ILGPU and ILGPU.Algorithms, moving the IntrinsicMath implementations into ILGPU itself. XMath has other functions that are not part of IntrisicMath, so it will stay as-is for now.

CLMath in ILGPU.Algorithms only needed to support Rcp and Log(x,y). These have been moved into ILGPU, and CLMath has been removed.

PTXMath in ILGPU.Algorithms provided a number of math functions using Cordic implemenentations. Now that pre-generated LibDevice is available in ILGPU, switched all the IntrinsicMath functions to call LibDevice for Cuda GPUs. The pre-generated LibDevice PTX code only works on >= SM_60, so the Cordic functions in ILGPU.Algorithms have been modified to only register for < SM_60. Otherwise, they are no longer used.

Unit Tests for IntrinsicMath have not been implemented. These are currently running via the ILGPU.Algorithms unit tests.

Added workaround for XMath.Pow(double, double) - the Cuda Test Runner has a different result for some combinations of inputs.

MoFtZ · 2024-01-14T12:28:17Z

Added minimum required Cuda architecture and ISA to pre-generated XML. This is then used in PTXBackend to report an error if LibDevice is called, and the embedded PTX is not compatible.

MoFtZ · 2024-01-14T13:08:13Z

Added Tools solution to CI pipeline to ensure that breaking changes are caught.

MoFtZ · 2024-01-22T09:54:22Z

Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12.

MoFtZ · 2024-04-05T00:16:40Z

Converted to draft, so that I can refactor the code into smaller PRs that are useful to ILGPU, independent of the pre-generated PTX.

MoFtZ · 2024-04-10T08:47:09Z

Refactoring complete.

MoFtZ force-pushed the feature/libdevice branch from b049269 to 5b588bf Compare January 14, 2024 12:26

MoFtZ force-pushed the feature/libdevice branch 5 times, most recently from 38a2f6d to 45d76d1 Compare January 14, 2024 12:55

MoFtZ mentioned this pull request Jan 17, 2024

Optimized PTX IntrinsicMath implementation to use LibDevice. #1151

Closed

MoFtZ force-pushed the feature/libdevice branch from 45d76d1 to ac36bf2 Compare January 19, 2024 08:56

MoFtZ marked this pull request as draft January 23, 2024 10:35

MoFtZ force-pushed the feature/libdevice branch from b46856d to 55d36d4 Compare January 31, 2024 23:06

MoFtZ marked this pull request as ready for review January 31, 2024 23:54

This was referenced Apr 4, 2024

Moved OpenCL IntrinsicMath implementations. #1185

Merged

Added Tools to CI pipeline. #1186

Merged

Updated NVVM to support Cuda SDK v8. #1187

Merged

MoFtZ marked this pull request as draft April 5, 2024 00:15

MoFtZ mentioned this pull request Apr 8, 2024

Optimized PTX IntrinsicMath implementation to use LibDevice. #1189

Merged

MoFtZ added 4 commits April 10, 2024 17:50

Moved LibDevice helper functions into separate class.

5ee7bc8

Added tool to generate LibDevice PTX.

f10fa2a

Implemented support for pre-generated LibDevice PTX modules.

3a26493

Added workaround for Cuda Test Runner.

e6cd7bd

MoFtZ force-pushed the feature/libdevice branch from 55d36d4 to e6cd7bd Compare April 10, 2024 07:50

MoFtZ marked this pull request as ready for review April 10, 2024 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented support for pre-generated LibDevice PTX modules. #1148

Implemented support for pre-generated LibDevice PTX modules. #1148

MoFtZ commented Jan 12, 2024 •

edited

Loading

MoFtZ commented Jan 14, 2024

MoFtZ commented Jan 14, 2024

MoFtZ commented Jan 22, 2024 •

edited

Loading

MoFtZ commented Apr 5, 2024

MoFtZ commented Apr 10, 2024

Implemented support for pre-generated LibDevice PTX modules. #1148

Are you sure you want to change the base?

Implemented support for pre-generated LibDevice PTX modules. #1148

Conversation

MoFtZ commented Jan 12, 2024 • edited Loading

MoFtZ commented Jan 14, 2024

MoFtZ commented Jan 14, 2024

MoFtZ commented Jan 22, 2024 • edited Loading

MoFtZ commented Apr 5, 2024

MoFtZ commented Apr 10, 2024

MoFtZ commented Jan 12, 2024 •

edited

Loading

MoFtZ commented Jan 22, 2024 •

edited

Loading