std.math.big: Implementation of assembly version of some functions and division improvement #23848

samy-00007 · 2025-05-10T16:56:41Z

A bigger pull request this time.

What this PR does:

add x86_64 assembly implementations for llaccum and llmulLimb (for both .add and .sub), with a special case for ReleaseSmall. There are also versions of llmulLimb for the .add operation if the adx cpu feature is supported (needed for adox and adcx instructions)
add generic versions of llaccum and llmulLimb for other architectures, with slight edits from the previous function to have a better codegen (for llmulLimb)
add tests for llaccum and llmulLimb
change the algorithm (overall very similar) for the division, and prepare the function for future implementations of better algorithms for larger numbers (coming soonish, hopefully). There is also a special case when the divisor fits in one Limb

I hope the addition of assembly version of these functions will be accepted, as I am not sure about the stance on assembly implementations in the std for functions that can be written without.
The current way I handle the assembly is suboptimal, and adding assembly for other architectures or other cpu features is cumbersome. It would probably be better to have a separate file for each architecture whenever these versions are added.
Also, my implementations (at least for llmulLimb) can still be improved.

Speed comparison on my machine:

Based on some quick tests for various size of randomly generated arrays of Limb (from 500 to 10000):

the new llaccum is ~5.5 times faster than the previous one for addition and ~4.3 times faster for subtraction
the new llmulLimb is ~2.5 times faster than the previous one (with adx) while it is ~2.2 times faster for both addition (without adx) and subtraction.

The new generic version of llmulLimb is faster than the previous one but slower than the assembly one (between 30 to 60% slower depending on the operation and whether adx is supported)
I did not manage to get better codegen for llaccum, which is partly why I resorted to assembly.

The division part

note: a "2n by n division" mean a division of a bigint with 2n limbs by a bigint of n limbs. Likewise, a "n by 1 division" mean a division of a bigint with n limbs by a bigint of 1 limb.
Also, "gmp (basecase division only)" is a gmp version where I only enabled the base algorithm. Once numbers get large enough, gmp switches to better division algorithms, so it is not a fair comparison, though I left default gmp in the graphs as a reference.

Division performance for 2n by n division is only improved by the new llmulLimb (leading to the same curve on the graph). The new division is ~2.2 times faster than the previous one, while being about 34% slower than gmp (basecase division only). Based on a few tests, this is due to gmp's better implementation of llmulLimb (which makes sense, since the algorithm used is otherwise near identical).

For n by 1 division, however, the new implementation is vastly superior, as it is ~6 times faster than the previous one, and about 32% slower than gmp. I got very similar graphs for n by 2 and n by 3 divisions (and haven't benched beyond). I don't know why the previous algorithm underperforms that much in these cases. Also, I haven't benched the now removed lldiv0p5, but the new implementation now doesn't care about half limbs and is probably faster.

Benching script here.

alexrp · 2025-05-11T18:55:30Z

lib/std/testing.zig

@@ -34,7 +34,7 @@ pub var log_level = std.log.Level.warn;
 // Disable printing in tests for simple backends.
 pub const backend_can_print = !(builtin.zig_backend == .stage2_spirv64 or builtin.zig_backend == .stage2_riscv64);

-fn print(comptime fmt: []const u8, args: anytype) void {
+pub fn print(comptime fmt: []const u8, args: anytype) void {


Was this intended?

Yes, it is used in the test with errdefer to print the arguments passed to testAccum and testMulLimb, for a better error trace when calling them from a loop, like in llaccum y full of 0. I can remove it, but for some tests it would result in a lot more lines (182 in the case of llaccum y full of 0)

samy-00007 · 2025-05-11T19:19:25Z

Should I start pinging jacobly each time I do a PR related to bigints ? I have at least 2 of them in preparation.

samy-00007 · 2025-05-13T21:49:42Z

I have a better division algorithm ready. I initially planned to put it in a new PR, but I can also add it in this one. What should I do ?

samy-00007 · 2025-05-16T19:30:15Z

I'm actually adding this new algorithm to the PR, since it is actually quite small.

The new algorithm uses a divide and conquer method, allowing it to leverage faster multiplication algorithms, and therefore being faster for larger numbers.

In the graph, the new algorithm is default_div (new), the one from this PR before the previous commit is default_div (prev) and the one currently in the std is default_div (old). The graph on the right is made with the same data, but only with the new algorithm and gmp.

Here is also an unbalanced division (5n by n instead of 2n by n):

On a side note, I don't think any test can trigger the new code since it requires a divisor of at least recursive_division_threshold, which is currently 100.

samy-00007 added 5 commits May 2, 2025 23:30

math.big.int: change llcmp to use math.Order

90234c0

fix: incorrect field access

1485f6a

math.big.int: added x86 assembly version of llaccum and llmulLimb

ce42591

math.big.int: improvement to the division algorithm

a1d85a1

fix: messed up the order of the result from div3by2

1cab18f

alexrp requested a review from jacobly0 May 11, 2025 18:55

alexrp reviewed May 11, 2025

View reviewed changes

samy-00007 force-pushed the bigint-basecase-division branch 2 times, most recently from 04c53e7 to e28ad90 Compare May 16, 2025 20:22

samy-00007 added 2 commits May 16, 2025 22:24

math.big: mul functions now indicate overflow

2a048f3

math.big: add asymptotically faster division algorithm

d15ccdb

samy-00007 force-pushed the bigint-basecase-division branch from e28ad90 to d15ccdb Compare May 16, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.math.big: Implementation of assembly version of some functions and division improvement #23848

std.math.big: Implementation of assembly version of some functions and division improvement #23848

samy-00007 commented May 10, 2025

alexrp May 11, 2025

samy-00007 May 11, 2025

samy-00007 commented May 11, 2025

samy-00007 commented May 13, 2025

samy-00007 commented May 16, 2025

std.math.big: Implementation of assembly version of some functions and division improvement #23848

Are you sure you want to change the base?

std.math.big: Implementation of assembly version of some functions and division improvement #23848

Conversation

samy-00007 commented May 10, 2025

Speed comparison on my machine:

The division part

alexrp May 11, 2025

Choose a reason for hiding this comment

samy-00007 May 11, 2025

Choose a reason for hiding this comment

samy-00007 commented May 11, 2025

samy-00007 commented May 13, 2025

samy-00007 commented May 16, 2025