Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of std::sin and std::cos in device code generates unwanted FP64 instructions #337

Open
stephenswat opened this issue Mar 9, 2023 · 5 comments
Labels
cuda Changes related to CUDA performance Performance-relevant changes

Comments

@stephenswat
Copy link
Member

stephenswat commented Mar 9, 2023

@krasznaa has recently been on a crusade to make traccc work with his non-FP64-compatible GPU (see e.g. #333 and #335). Instead of hunting these errors down manually, we can do this automatically (see #336). However, the way we have decided to program traccc and its dependencies (in particular detray) will make it difficult to completely eliminate the slow 64-bit instructions. Consider the following source code that is generated in fitting_algorithm.ptx:

.func  (.param .b32 func_retval0) cosf(
    .param .b32 cosf_param_0
)
{
    .local .align 4 .b8     __local_depot1092[28];
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<22>;
    .reg .f32   %f<44>;
    .reg .b32   %r<69>;
    .reg .f64   %fd<3>;
    .reg .b64   %rd<28>;


    mov.u64     %SPL, __local_depot1092;
    cvta.local.u64  %SP, %SPL;
    ld.param.f32    %f16, [cosf_param_0];
    bra.uni     $L__BB1092_1;

   ...

$L__BB1092_13:
    mov.u32     %r28, %r66;
    mov.u32     %r27, %r65;
    mov.u32     %r26, %r64;
    cvt.u64.u32     %rd15, %r27;
    shl.b64     %rd16, %rd15, 32;
    cvt.u64.u32     %rd17, %r28;
    or.b64      %rd18, %rd16, %rd17;
    cvt.rn.f64.s64  %fd1, %rd18;
    mul.f64     %fd2, %fd1, 0d3BF921FB54442D19;
    cvt.rn.f32.f64  %f3, %fd2;
    setp.ne.s32     %p13, %r26, 0;
    not.pred    %p14, %p13;
    mov.f32     %f38, %f3;
    @%p14 bra   $L__BB1092_15;
    bra.uni     $L__BB1092_14;

It is not hard to identify that the 64-bit floating point instructions are being generated as a result of the use of std::sin. There is a similar case with the use of std::cos. The canonical way of implementing this in CUDA, if single-precision does indeed provide sufficient precision, is to use the __sinf compiler intrinsic. Currently, we don't really have a way of controlling the implementation that is used, as this is abstracted away behind detray and algebra-plugins.

@stephenswat stephenswat added cuda Changes related to CUDA bug Something isn't working labels Mar 9, 2023
@krasznaa
Copy link
Member

krasznaa commented Mar 9, 2023

Bull#$^... 😕 Double- and triple-check that we are not mistakenly providing double inputs to those trigonometric functions. I very much suspect that we are.

If anything, we may want to switch to using std::sinf and friends.

But in the end we shouldn't be using any of those. We'll need to make all of them use the trigonometric functions from:

https://github.com/acts-project/algebra-plugins/blob/main/math/common/include/algebra/math/common.hpp

sin/cos is not there yet, but there are for instance a number of places in our code where std::sqrt is used instead of algebra::math::sqrt.

@stephenswat
Copy link
Member Author

stephenswat commented Mar 9, 2023

I invite you to compile the following extremely trivial CUDA code and inspect the PTX:

#include <cmath>

__global__ void sins(float * f) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    f[tid] = std::sin(f[tid]);
}

nvcc -c --keep test.cu && cat test.ptx and you should find those 64-bit floating point instructions.

Then please consider the following lines:

https://github.com/acts-project/detray/blob/main/core/include/detray/tracks/detail/track_helper.hpp#L89
https://github.com/acts-project/detray/blob/main/core/include/detray/definitions/math.hpp#L24

That should clear up what's happening.

Looks like it's detray and not algebra plugins, but potato/potato.

@stephenswat
Copy link
Member Author

Pinging @niermann999 @beomki-yeo.

@stephenswat
Copy link
Member Author

Something else to mention: this effect goes away when using --use_fast_math. We'll need to check whether using pure 32-bit trigonometry provides us with the precision we need. If it does, we should consider whether we want to switch to using intrinsic trig functions whether we want to enable non-compliant math.

@stephenswat
Copy link
Member Author

stephenswat commented Mar 9, 2023

Okay, so after looking into this a bit more, the use of double-precision in the single-precision trigonometry functions is a relatively uncommon branch to cover subnormal floating point numbers. The difference in performance between std::sin and __sinf is small but certainly present; taking the sine of one billion floating point numbers on an A6000 is approximately 3% faster using the intrinsic. However, this is in a very simple kernel. Interestingly, the implementation using the standard library may have significantly higher register pressure. I cannot currently say anything about the liveness of those registers, but this effect could potentially impact the performance of trigonometry functions in code that is already bound by register pressure. According to the PTX standard, the use of the sine and cosine have a maximum absolute error of 2-20.9 or 5.1 × 10-7.

The way in which we proceed here should depend on our attitude towards the use of double-precision floating point numbers and our willingness to sacrifice performance for convenience. If we wish to completely eliminate double-precision floating point numbers, we should exclusively use the approximation intrinsics. For performance, this would also be preferable, but this would involve some work to incorporate it into detray and algebra plugins. The alternative would be to enable fast math, but this may have other unintended side-effects.

@stephenswat stephenswat added performance Performance-relevant changes and removed bug Something isn't working labels Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA performance Performance-relevant changes
Projects
None yet
Development

No branches or pull requests

2 participants