device-libs: Split powF into separate fast entry points#1265
device-libs: Split powF into separate fast entry points#1265arsenm wants to merge 1 commit intoamd-stagingfrom
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
6e89c6d to
1eeb605
Compare
The compiler needs to make the contextual decision to switch
a particular call to the fast version based on the fast math flags.
The global option is inflexible, requires the whole translation unit to
use the same version and requires duplicating the function into every
translation unit. The compiler needs a separate entry point to do this.
Give pow, powr, pown, and rootn _fast suffixed variants to call. This will
now define __ocml_pow_fast_f32 __ocml_powr_fast_f32, __ocml_pown_fast_f32,
and __ocml_rootn_fast_f32 as the implementation fast entry points.
Additionally, the opencl library now defines __pow_fast, __powr_fast,
__pown_fast, and __rootn_fast overloads as the public entry points.
For now leave the UNSAFE_MATH_OPT check and redirect to the fast version
from the base function to stage the change to avoid commit order dependence
between the library and compiler.
Document the worst case ulp values. I extracted these by hacking up
the conformance test to report better information in the fast cases. This was
more painful than I expected because
- test_bruteforce only tests pow with relaxed math and doesn't verify the ulp,
so I had to force it to report values and also handle powr/pown/rootn.
- Relaxed testing is done with -cl-fast-relaxed-math instead of
-cl-unsafe-math-optimizations, so nans were breaking even though these
implementations do not depend on finite only.
1eeb605 to
d46b51c
Compare
|
This looks OK, but I'm wondering about how we're going to ensure the published ULP limits continue to be met as the compiler evolves and as we add new HW? |
The compiler isn't really making precision decisions, it's following what the library code does. But this isn't any different from the other documented bounds here which ideally would be updated as appropriate |
I'm concerned about regression detection. If somehow the accuracy drops by a thousand ulp is that going to be detected quickly? And is our answer going to be to drop the guaranteed accuracy? I don't think so. Once we publish that limit, we had better not ever raise it. |
The compiler needs to make the contextual decision to switch a particular call to the fast version based on the fast math flags. The global option is inflexible, requires the whole translation unit to use the same version and requires duplicating the function into every translation unit. The compiler needs a separate entry point to do this.
Give pow, powr, pown, and rootn _fast suffixed variants to call. This will now define __ocml_pow_fast_f32 __ocml_powr_fast_f32, __ocml_pown_fast_f32, and __ocml_rootn_fast_f32 as the implementation fast entry points.
Additionally, the opencl library now defines __pow_fast, __powr_fast, __pown_fast, and __rootn_fast overloads as the public entry points.
For now leave the UNSAFE_MATH_OPT check and redirect to the fast version from the base function to stage the change to avoid commit order dependence between the library and compiler.
Document the worst case ulp values. I extracted these by hacking up the conformance test to report better information in the fast cases. This was more painful than I expected because