Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
e7a922f
Fast vectorizable atan and atan2 functions.
mcourteaux Aug 10, 2024
ceb65f1
Default to not using fast atan versions if on CUDA.
mcourteaux Aug 10, 2024
11add5b
Finished fast atan/atan2 functions and tests.
mcourteaux Aug 10, 2024
a81fc66
Correct attribution.
mcourteaux Aug 10, 2024
60c4d68
Clang-format
mcourteaux Aug 10, 2024
9ff6314
Weird WebAssembly limits...
mcourteaux Aug 11, 2024
3522538
Small improvements to the optimization script.
mcourteaux Aug 11, 2024
b0937e0
Polynomial optimization for log, exp, sin, cos with correct ranges.
mcourteaux Aug 11, 2024
f3c962c
Improve fast atan performance tests for GPU.
mcourteaux Aug 12, 2024
da50b61
Bugfix fast_atan approximation. Fix correctness test to exceed the ra…
mcourteaux Aug 12, 2024
c030de4
Cleanup
mcourteaux Aug 12, 2024
d07275e
Enum class instead of enum for ApproximationPrecision.
mcourteaux Aug 12, 2024
dffb7c8
Weird Metal limits. There should be a better way...
mcourteaux Aug 12, 2024
22fff6f
Skip test for WebGPU.
mcourteaux Aug 12, 2024
04e583a
Fast atan/atan2 polynomials reoptimized. New optimization strategy: ULP.
mcourteaux Aug 13, 2024
eaa5a9a
Feedback Steven.
mcourteaux Aug 13, 2024
4cd9bdc
More comments and test mantissa error.
mcourteaux Aug 14, 2024
25cb178
Do not error when testing arctan performance on Metal / WebGPU.
mcourteaux Aug 14, 2024
9b2ae31
Rework precision specification. Generalize towards using this for oth…
mcourteaux Nov 11, 2024
4c5f7d9
Clang-format.
mcourteaux Nov 11, 2024
0dfe005
Fix makefile and clang-tidy.
mcourteaux Nov 11, 2024
0d33ba2
Fix incorrect approximation selection when required precision is not …
mcourteaux Nov 12, 2024
3f4e410
Feedback from Steven.
mcourteaux Dec 3, 2024
933d2cb
Implemented approximation tables for sin, cos, exp, log fast variants…
mcourteaux Feb 4, 2025
5991920
Clang-format.
mcourteaux Feb 4, 2025
9d3bb75
Move Polynomial Optimizer Python script to tools/ directory.
mcourteaux Feb 4, 2025
cf7f8a1
Enable performance test for fast_atan and fast_atan2.
mcourteaux Feb 4, 2025
12d873c
LLVM upper-limit 99 (CMake needs an upper limit).
mcourteaux Feb 4, 2025
966c0d1
Add LLVM IR for PTX sin.approx, cos.approx, tanh.approx
mcourteaux Feb 4, 2025
4781620
Implemented tan. Improved polynomial optimizer performance for MULPE …
mcourteaux Feb 5, 2025
242e519
Implemented tanh, tan. Many improvements to accuracy test and perform…
mcourteaux Feb 5, 2025
546cd7d
Clang-format.
mcourteaux Feb 5, 2025
4183dbb
WIP: Fiddle with strict_float behavior in CSE. Fix fast math precisio…
mcourteaux Feb 7, 2025
3eca092
Nuke MAE_MULPE. Separate optimized MULPE-corrected sin and cos.
mcourteaux Feb 8, 2025
3f1aa42
Clang-format
mcourteaux Feb 8, 2025
b08134b
Some cleanup.
mcourteaux Feb 8, 2025
e27503f
Fix sine.
mcourteaux Feb 8, 2025
f97a0ad
Fix clang-tidy. Mark OpenCL exp() as fast.
mcourteaux Feb 8, 2025
b91c6b5
Clang format is annoying me.
mcourteaux Feb 8, 2025
826b53f
Remove my experimental CSE step.
mcourteaux Feb 9, 2025
3b4c28d
OpenCL performance of fast_exp forced poly is expected to be worse.
mcourteaux Feb 9, 2025
2c0ff67
OpenCL fast functions selected for fast transcendentals.
mcourteaux Feb 9, 2025
b61928b
Lower fast intrinsics on metal to the fast:: namespace versions.
mcourteaux Feb 9, 2025
a05a40c
Split tables for sin and cos, as metal has odd precision for sin. Add…
mcourteaux Feb 9, 2025
d9d831d
Move range_reduce_log to a header. Drive-by fix listing libOpenCL.so.…
mcourteaux Feb 10, 2025
59ecb6b
Fix API documentation. Improve measuring accuracy. Fix vector_math te…
mcourteaux Feb 10, 2025
08bcff6
Also vectorize on GPU to make sure we test that.
mcourteaux Feb 11, 2025
5858e2f
Add FastMathFunctions.cpp to Makefile
mcourteaux Feb 11, 2025
54d884d
Add support for derivatives for the fast_ intrinsics.
mcourteaux Feb 11, 2025
f5a6a10
Remove unused helper function.
mcourteaux Feb 11, 2025
a8767bd
Add in a gracefactor for precision when the system does not support FMA.
mcourteaux Feb 11, 2025
443a7c3
Clang Format.
mcourteaux Feb 11, 2025
a3bf33c
Windows doesn't print thousand separaters with printf. :(
mcourteaux Feb 11, 2025
227dd99
Remove grace factor, and use safety factor of 5% when selecting a pol…
mcourteaux Feb 16, 2025
5d25eb8
Use 50% tighter constraints when no FMA is available to compensate fo…
mcourteaux Feb 17, 2025
95ca768
Clang-format.
mcourteaux Feb 17, 2025
c8871a8
Working on better optimizations. Improving PR and code.
mcourteaux Mar 12, 2025
68aa6bf
Implemented fast_asin() fast_acos(). Slowly redoing coefficients.
mcourteaux Mar 12, 2025
9c5e3d1
WIP: determine precision of the polynomials.
mcourteaux Mar 13, 2025
366abe2
Revived all tests.
mcourteaux Mar 14, 2025
186c8bb
Clang format
mcourteaux Mar 14, 2025
157d59e
Implement expm1. Fix accuracy of tanh. Fix lowering of tanh on CUDA. …
mcourteaux Mar 15, 2025
87fb2ca
Clang-format
mcourteaux Mar 15, 2025
6ca6457
Feedback, and remove expm1 test.
mcourteaux Mar 15, 2025
84408af
Fix compilation issues.
mcourteaux Mar 15, 2025
229c2da
One more compilation issue.
mcourteaux Mar 15, 2025
eac79b5
Fixed a bracket.
mcourteaux Mar 15, 2025
efabc05
Update some precision info on math intrinsics for Vulkan and Metal.
mcourteaux Mar 17, 2025
38606bc
Fix makefile after I accidentally broke it by sorting files alphabeti…
mcourteaux Apr 9, 2025
daf492a
Add fast math calls to new extern_function_name_map for OpenCL.
mcourteaux Jun 1, 2025
a695340
Move fast function calls to extern table for Metal.
mcourteaux Jun 1, 2025
733514d
Try to fix compile/test issues.
mcourteaux Jun 1, 2025
8947659
Fix Makefile and symbol visibility issue.
mcourteaux Jun 1, 2025
19d31db
Clang-format
mcourteaux Jun 1, 2025
7f4b655
Make use of the new strict_float intrinsics for the fast math functions.
mcourteaux Jun 14, 2025
1b77e28
Relax performance tests for GPUs.
mcourteaux Jun 14, 2025
225b8e9
Clang-format
mcourteaux Jun 14, 2025
f24228e
Fix incorrect forward declaration.
mcourteaux Jun 14, 2025
23c9251
Fix acos on Metal. Relax perf-test for tanh on OpenCL.
mcourteaux Jun 16, 2025
ee33b9b
Fix strict float behavior for the fast_tan function. Implemented spli…
mcourteaux Jul 3, 2025
aee1786
Enable fp16 fast_math functions without promises.
mcourteaux Jul 3, 2025
3183778
Clear internal assert, as it assumed SSE floating point behavior, whi…
mcourteaux Jul 3, 2025
7f3f77e
Let CodeGen_C handle all float-literal printing (also for Float(16) i…
mcourteaux Jul 4, 2025
3cdcc70
Fix internal test for CodeGen_C given the scientific way of printing …
mcourteaux Jul 4, 2025
7702d13
Merge branch 'main' into fast-math-lowering
mcourteaux Feb 1, 2026
6c82133
Update.
mcourteaux Feb 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,9 @@ xcuserdata
# NeoVim + clangd
.cache

# CCLS
.ccls-cache

# Emacs
tags
TAGS
Expand Down
88 changes: 46 additions & 42 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -450,21 +450,24 @@ SOURCE_FILES = \
AlignLoads.cpp \
AllocationBoundsInference.cpp \
ApplySplit.cpp \
ApproximationTables.cpp \
Argument.cpp \
AssociativeOpsTable.cpp \
Associativity.cpp \
AsyncProducers.cpp \
AutoScheduleUtils.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
BoundaryConditions.cpp \
Bounds.cpp \
BoundsInference.cpp \
BoundConstantExtentLoops.cpp \
BoundSmallAllocations.cpp \
Buffer.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Callable.cpp \
CanonicalizeGPUVars.cpp \
Closure.cpp \
ClampUnsafeAccesses.cpp \
Closure.cpp \
CodeGen_ARM.cpp \
CodeGen_C.cpp \
CodeGen_D3D12Compute_Dev.cpp \
Expand All @@ -474,20 +477,18 @@ SOURCE_FILES = \
CodeGen_LLVM.cpp \
CodeGen_Metal_Dev.cpp \
CodeGen_OpenCL_Dev.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_Posix.cpp \
CodeGen_PowerPC.cpp \
CodeGen_PTX_Dev.cpp \
CodeGen_PyTorch.cpp \
CodeGen_RISCV.cpp \
CodeGen_Vulkan_Dev.cpp \
CodeGen_WebAssembly.cpp \
CodeGen_WebGPU_Dev.cpp \
CodeGen_X86.cpp \
CompilerLogger.cpp \
ConstantBounds.cpp \
ConstantInterval.cpp \
CPlusPlusMangle.cpp \
CSE.cpp \
Debug.cpp \
DebugArguments.cpp \
DebugToFile.cpp \
Expand All @@ -508,6 +509,7 @@ SOURCE_FILES = \
Expr.cpp \
ExtractTileOperations.cpp \
FastIntegerDivide.cpp \
FastMathFunctions.cpp \
FindCalls.cpp \
FindIntrinsics.cpp \
FlattenNestedRamps.cpp \
Expand All @@ -519,26 +521,26 @@ SOURCE_FILES = \
Generator.cpp \
HexagonOffload.cpp \
HexagonOptimize.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
IR.cpp \
IREquality.cpp \
IRMatch.cpp \
IRMutator.cpp \
IROperator.cpp \
IRPrinter.cpp \
IRVisitor.cpp \
ImageParam.cpp \
InferArguments.cpp \
InjectHostDevBufferCopies.cpp \
Inline.cpp \
InlineReductions.cpp \
IntegerDivisionTable.cpp \
Interval.cpp \
JITModule.cpp \
Lambda.cpp \
Lerp.cpp \
LICM.cpp \
LLVM_Output.cpp \
LLVM_Runtime_Linker.cpp \
Lambda.cpp \
Lerp.cpp \
LoopCarry.cpp \
Lower.cpp \
LowerParallelTasks.cpp \
Expand All @@ -561,8 +563,8 @@ SOURCE_FILES = \
PurifyIndexMath.cpp \
PythonExtensionGen.cpp \
Qualify.cpp \
Random.cpp \
RDom.cpp \
Random.cpp \
Realization.cpp \
RealizationOrder.cpp \
RebaseLoopsToZero.cpp \
Expand All @@ -576,28 +578,28 @@ SOURCE_FILES = \
SelectGPUAPI.cpp \
Serialization.cpp \
Simplify.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
Simplify_Add.cpp \
Simplify_And.cpp \
Simplify_Call.cpp \
Simplify_Cast.cpp \
Simplify_Reinterpret.cpp \
Simplify_Div.cpp \
Simplify_EQ.cpp \
Simplify_Exprs.cpp \
Simplify_Let.cpp \
Simplify_LT.cpp \
Simplify_Let.cpp \
Simplify_Max.cpp \
Simplify_Min.cpp \
Simplify_Mod.cpp \
Simplify_Mul.cpp \
Simplify_Not.cpp \
Simplify_Or.cpp \
Simplify_Reinterpret.cpp \
Simplify_Select.cpp \
Simplify_Shuffle.cpp \
Simplify_Stmts.cpp \
Simplify_Sub.cpp \
SimplifyCorrelatedDifferences.cpp \
SimplifySpecializations.cpp \
SkipStages.cpp \
SlidingWindow.cpp \
Solve.cpp \
Expand Down Expand Up @@ -649,17 +651,20 @@ HEADER_FILES = \
AlignLoads.h \
AllocationBoundsInference.h \
ApplySplit.h \
ApproximationTables.h \
Argument.h \
AssociativeOpsTable.h \
Associativity.h \
AsyncProducers.h \
AutoScheduleUtils.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
BoundaryConditions.h \
Bounds.h \
BoundsInference.h \
BoundConstantExtentLoops.h \
BoundSmallAllocations.h \
Buffer.h \
CPlusPlusMangle.h \
CSE.h \
Callable.h \
CanonicalizeGPUVars.h \
ClampUnsafeAccesses.h \
Expand All @@ -671,18 +676,16 @@ HEADER_FILES = \
CodeGen_LLVM.h \
CodeGen_Metal_Dev.h \
CodeGen_OpenCL_Dev.h \
CodeGen_Vulkan_Dev.h \
CodeGen_Posix.h \
CodeGen_PTX_Dev.h \
CodeGen_Posix.h \
CodeGen_PyTorch.h \
CodeGen_Targets.h \
CodeGen_Vulkan_Dev.h \
CodeGen_WebGPU_Dev.h \
CompilerLogger.h \
ConciseCasts.h \
CPlusPlusMangle.h \
ConstantBounds.h \
ConstantInterval.h \
CSE.h \
Debug.h \
DebugArguments.h \
DebugToFile.h \
Expand All @@ -707,6 +710,7 @@ HEADER_FILES = \
ExternFuncArgument.h \
ExtractTileOperations.h \
FastIntegerDivide.h \
FastMathFunctions.h \
FindCalls.h \
FindIntrinsics.h \
FlattenNestedRamps.h \
Expand All @@ -719,6 +723,13 @@ HEADER_FILES = \
Generator.h \
HexagonOffload.h \
HexagonOptimize.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
ImageParam.h \
InferArguments.h \
InjectHostDevBufferCopies.h \
Expand All @@ -727,20 +738,12 @@ HEADER_FILES = \
IntegerDivisionTable.h \
Interval.h \
IntrusivePtr.h \
IR.h \
IREquality.h \
IRMatch.h \
IRMutator.h \
IROperator.h \
IRPrinter.h \
IRVisitor.h \
WasmExecutor.h \
JITModule.h \
Lambda.h \
Lerp.h \
LICM.h \
LLVM_Output.h \
LLVM_Runtime_Linker.h \
Lambda.h \
Lerp.h \
LoopCarry.h \
LoopPartitioningDirective.h \
Lower.h \
Expand All @@ -766,18 +769,16 @@ HEADER_FILES = \
PurifyIndexMath.h \
PythonExtensionGen.h \
Qualify.h \
RDom.h \
Random.h \
Realization.h \
RDom.h \
RealizationOrder.h \
RebaseLoopsToZero.h \
Reduction.h \
RegionCosts.h \
RemoveDeadAllocations.h \
RemoveExternLoops.h \
RemoveUndef.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h \
Schedule.h \
ScheduleFunctions.h \
Scope.h \
Expand Down Expand Up @@ -811,7 +812,10 @@ HEADER_FILES = \
Util.h \
Var.h \
VectorizeLoops.h \
WrapCalls.h
WasmExecutor.h \
WrapCalls.h \
runtime/HalideBuffer.h \
runtime/HalideRuntime.h

OBJECTS = $(SOURCE_FILES:%.cpp=$(BUILD_DIR)/%.o)
HEADERS = $(HEADER_FILES:%.h=$(SRC_DIR)/%.h)
Expand Down Expand Up @@ -913,7 +917,7 @@ RUNTIME_CPP_COMPONENTS = \
windows_yield \
write_debug_image \
vulkan \
x86_cpu_features \
x86_cpu_features

RUNTIME_LL_COMPONENTS = \
aarch64 \
Expand Down
Loading
Loading