Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented support for pre-generated LibDevice PTX modules. #1148

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

MoFtZ
Copy link
Collaborator

@MoFtZ MoFtZ commented Jan 12, 2024

Depends on #1147.

Currently, ILGPU has support for LibDevice at runtime by using NVVM to generate PTX, and then merging the generated PTX into the kernel PTX code. This requires the CUDA SDK to be installed.

This PR pre-generates the PTX code, and then embeds that into ILGPU, so that we can remove the requirement for the CUDA SDK to be installed for LibDevice support.

The first step is a tool that is manually run, on a machine with the CUDA SDK installed. The tool uses NVVM to generate the PTX, and saves it to an XML file for later consumption.

Compiling ILGPU will read this file and generate C# code using T4 templates.

ILGPU has also been updated to support combining multiple PTX modules. This is necessary to avoid clashes in the pre-generated PTX code. e.g. Sin(double) and Cos(double) both use the same helper functions in PTX.

Finally, Context.LibDevice() has been marked as [Obsolete] and is a no-op. It has been renamed to LibDeviceOverride() if a user wants to force a different version of LibDevice at runtime.

QUESTION: The PTX was generated using Cuda SDK v12 and targets SM_60. I'm assuming this means that all users will have to use at least v12 or newer? And devices that are SM_60 or newer?
Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12.

Restructured ILGPU and ILGPU.Algorithms, moving the IntrinsicMath implementations into ILGPU itself. XMath has other functions that are not part of IntrisicMath, so it will stay as-is for now.

CLMath in ILGPU.Algorithms only needed to support Rcp and Log(x,y). These have been moved into ILGPU, and CLMath has been removed.

PTXMath in ILGPU.Algorithms provided a number of math functions using Cordic implemenentations. Now that pre-generated LibDevice is available in ILGPU, switched all the IntrinsicMath functions to call LibDevice for Cuda GPUs. The pre-generated LibDevice PTX code only works on >= SM_60, so the Cordic functions in ILGPU.Algorithms have been modified to only register for < SM_60. Otherwise, they are no longer used.

Unit Tests for IntrinsicMath have not been implemented. These are currently running via the ILGPU.Algorithms unit tests.

Added workaround for XMath.Pow(double, double) - the Cuda Test Runner has a different result for some combinations of inputs.

@MoFtZ MoFtZ force-pushed the feature/libdevice branch from b049269 to 5b588bf Compare January 14, 2024 12:26
@MoFtZ
Copy link
Collaborator Author

MoFtZ commented Jan 14, 2024

Added minimum required Cuda architecture and ISA to pre-generated XML. This is then used in PTXBackend to report an error if LibDevice is called, and the embedded PTX is not compatible.

@MoFtZ MoFtZ force-pushed the feature/libdevice branch 5 times, most recently from 38a2f6d to 45d76d1 Compare January 14, 2024 12:55
@MoFtZ
Copy link
Collaborator Author

MoFtZ commented Jan 14, 2024

Added Tools solution to CI pipeline to ensure that breaking changes are caught.

@MoFtZ
Copy link
Collaborator Author

MoFtZ commented Jan 22, 2024

Downgraded the pre-generated PTX to use Cuda SDK v8, which is the introduction of SM_60. This will make it compatible with all SM_60 architectures, along with the oldest drivers that support it. This was necessary since the Cuda Github runners are on SDK v11, and we originally pre-generated using SDK v12.

@MoFtZ MoFtZ marked this pull request as draft January 23, 2024 10:35
@MoFtZ MoFtZ force-pushed the feature/libdevice branch from b46856d to 55d36d4 Compare January 31, 2024 23:06
@MoFtZ MoFtZ marked this pull request as ready for review January 31, 2024 23:54
@MoFtZ MoFtZ marked this pull request as draft April 5, 2024 00:15
@MoFtZ
Copy link
Collaborator Author

MoFtZ commented Apr 5, 2024

Converted to draft, so that I can refactor the code into smaller PRs that are useful to ILGPU, independent of the pre-generated PTX.

@MoFtZ MoFtZ force-pushed the feature/libdevice branch from 55d36d4 to e6cd7bd Compare April 10, 2024 07:50
@MoFtZ
Copy link
Collaborator Author

MoFtZ commented Apr 10, 2024

Refactoring complete.

@MoFtZ MoFtZ marked this pull request as ready for review April 10, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant