Begin transition to making a release version #325

gwoltman · 2025-03-20T16:13:19Z

Made some existing #define options available as -use options instead. Feel free to come up with better names for these options.

…0,1,2,3

…xposed MIDDLE_LDS_TRANSPOSE setting with -use. Intel Battlemage reportedly prefers LDS transpose off.

…g every 100K or 1M iterations. I did not add this to the --help output in case you want to keep the option hidden.

Added -use ZEROHACK_H=1 in tailSquare to mirror the code in carryFused.

… :3 fft specs

This is the best setting on Intel, AMD, and nVidia (at last until the next rocm optimizer change :)

Wrote alternate chainmul8 that uess fewer F64 ops (faster on low DP GPUs) but has worse roundoff error. We will need data from some of these GPUs to decide if this chainMul8 version should be made an official FFT spec option. Cleaned up terminology in math.cl csq and ccube macros. Eliminated FancyUpdate macros. There may be slight improvement in Z values.

More importantly, improved Z value

This version uses fewer F64 ops, but is slower on Radeon 7 -- probably the rocm optimizer acting up. New version is disabled. I'll ask some users to see if it will be beneficial on other GPUs.

… MM_CHAIN=0 case

gwoltman added 14 commits March 18, 2025 01:05

SINGLE_WIDE and SINGLE_KERNEL #defines replaced by -use TAIL_KERNELS=…

660e630

…0,1,2,3

Replaced PREFER_DP_TO_MEM with -use TAIL_TRIGS=0,1,2

7cf400e

Changes -use PAD=n froom an on/off value to number of bytes to pad. E…

8de8f0d

…xposed MIDDLE_LDS_TRANSPOSE setting with -use. Intel Battlemage reportedly prefers LDS transpose off.

Corrected default PAD value to match middle.cl default PAD value

f85f7c3

Fixed bug in single kernel tailSquare with TAIL_TRIGS=0 or 1.

a2eda75

Added -log argument for my personal use. I've always preferred loggin…

24344f5

…g every 100K or 1M iterations. I did not add this to the --help output in case you want to keep the option hidden.

Added -use ZEROHACK_W=1 to replace klunky -UNROLL_W=2 or 3.

2b00d47

Added -use ZEROHACK_H=1 in tailSquare to mirror the code in carryFused.

Slightly faster MM2_CHAIN. Improves speed of infrequently used :1 and…

481a3a6

… :3 fft specs

Changed default TAIL_KERNELS to double-wide, single kernel.

4dc981f

This is the best setting on Intel, AMD, and nVidia (at last until the next rocm optimizer change :)

Rewrote MM2_CHAIN = 0 case. Saved a fairly insignificant 2 F64 ops.

947c51b

More importantly, improved Z value

Coded up a new MM_CHAIN=1 version for TRIG_HI (fft spec :1 and :3).

4dd475f

This version uses fewer F64 ops, but is slower on Radeon 7 -- probably the rocm optimizer acting up. New version is disabled. I'll ask some users to see if it will be beneficial on other GPUs.

Saved a couple of F64 ops along with a tiny Z improvement by tweaking…

a55a071

… MM_CHAIN=0 case

Dialed back one of the MM2_CHAIN=0 changes to keep rocm optimizer happy

4d2cad5

preda merged commit d12753e into preda:master Mar 24, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Begin transition to making a release version #325

Begin transition to making a release version #325

Uh oh!

gwoltman commented Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

Begin transition to making a release version #325

Begin transition to making a release version #325

Uh oh!

Conversation

gwoltman commented Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!