Skip to content

std.math.big: Implementation of assembly version of some functions and division improvement #23848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

samy-00007
Copy link
Contributor

A bigger pull request this time.

What this PR does:

  • add x86_64 assembly implementations for llaccum and llmulLimb (for both .add and .sub), with a special case for ReleaseSmall. There are also versions of llmulLimb for the .add operation if the adx cpu feature is supported (needed for adox and adcx instructions)
  • add generic versions of llaccum and llmulLimb for other architectures, with slight edits from the previous function to have a better codegen (for llmulLimb)
  • add tests for llaccum and llmulLimb
  • change the algorithm (overall very similar) for the division, and prepare the function for future implementations of better algorithms for larger numbers (coming soonish, hopefully). There is also a special case when the divisor fits in one Limb

I hope the addition of assembly version of these functions will be accepted, as I am not sure about the stance on assembly implementations in the std for functions that can be written without.
The current way I handle the assembly is suboptimal, and adding assembly for other architectures or other cpu features is cumbersome. It would probably be better to have a separate file for each architecture whenever these versions are added.
Also, my implementations (at least for llmulLimb) can still be improved.

Speed comparison on my machine:

Based on some quick tests for various size of randomly generated arrays of Limb (from 500 to 10000):

  • the new llaccum is ~5.5 times faster than the previous one for addition and ~4.3 times faster for subtraction
  • the new llmulLimb is ~2.5 times faster than the previous one (with adx) while it is ~2.2 times faster for both addition (without adx) and subtraction.

The new generic version of llmulLimb is faster than the previous one but slower than the assembly one (between 30 to 60% slower depending on the operation and whether adx is supported)
I did not manage to get better codegen for llaccum, which is partly why I resorted to assembly.

The division part

note: a "2n by n division" mean a division of a bigint with 2n limbs by a bigint of n limbs. Likewise, a "n by 1 division" mean a division of a bigint with n limbs by a bigint of 1 limb.
Also, "gmp (basecase division only)" is a gmp version where I only enabled the base algorithm. Once numbers get large enough, gmp switches to better division algorithms, so it is not a fair comparison, though I left default gmp in the graphs as a reference.

Division performance for 2n by n division is only improved by the new llmulLimb (leading to the same curve on the graph). The new division is ~2.2 times faster than the previous one, while being about 34% slower than gmp (basecase division only). Based on a few tests, this is due to gmp's better implementation of llmulLimb (which makes sense, since the algorithm used is otherwise near identical).

For n by 1 division, however, the new implementation is vastly superior, as it is ~6 times faster than the previous one, and about 32% slower than gmp. I got very similar graphs for n by 2 and n by 3 divisions (and haven't benched beyond). I don't know why the previous algorithm underperforms that much in these cases. Also, I haven't benched the now removed lldiv0p5, but the new implementation now doesn't care about half limbs and is probably faster.

graph

Benching script here.

@alexrp alexrp requested a review from jacobly0 May 11, 2025 18:55
@@ -34,7 +34,7 @@ pub var log_level = std.log.Level.warn;
// Disable printing in tests for simple backends.
pub const backend_can_print = !(builtin.zig_backend == .stage2_spirv64 or builtin.zig_backend == .stage2_riscv64);

fn print(comptime fmt: []const u8, args: anytype) void {
pub fn print(comptime fmt: []const u8, args: anytype) void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is used in the test with errdefer to print the arguments passed to testAccum and testMulLimb, for a better error trace when calling them from a loop, like in llaccum y full of 0. I can remove it, but for some tests it would result in a lot more lines (182 in the case of llaccum y full of 0)

@samy-00007
Copy link
Contributor Author

Should I start pinging jacobly each time I do a PR related to bigints ? I have at least 2 of them in preparation.

@samy-00007
Copy link
Contributor Author

I have a better division algorithm ready. I initially planned to put it in a new PR, but I can also add it in this one. What should I do ?

@samy-00007
Copy link
Contributor Author

I'm actually adding this new algorithm to the PR, since it is actually quite small.

The new algorithm uses a divide and conquer method, allowing it to leverage faster multiplication algorithms, and therefore being faster for larger numbers.

In the graph, the new algorithm is default_div (new), the one from this PR before the previous commit is default_div (prev) and the one currently in the std is default_div (old). The graph on the right is made with the same data, but only with the new algorithm and gmp.
2n by n division

Here is also an unbalanced division (5n by n instead of 2n by n):
5n by n division

On a side note, I don't think any test can trigger the new code since it requires a divisor of at least recursive_division_threshold, which is currently 100.

@samy-00007 samy-00007 force-pushed the bigint-basecase-division branch 2 times, most recently from 04c53e7 to e28ad90 Compare May 16, 2025 20:22
@samy-00007 samy-00007 force-pushed the bigint-basecase-division branch from e28ad90 to d15ccdb Compare May 16, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants