-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
std.math.big: Implementation of assembly version of some functions and division improvement #23848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -34,7 +34,7 @@ pub var log_level = std.log.Level.warn; | |||
// Disable printing in tests for simple backends. | |||
pub const backend_can_print = !(builtin.zig_backend == .stage2_spirv64 or builtin.zig_backend == .stage2_riscv64); | |||
|
|||
fn print(comptime fmt: []const u8, args: anytype) void { | |||
pub fn print(comptime fmt: []const u8, args: anytype) void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is used in the test with errdefer
to print the arguments passed to testAccum
and testMulLimb
, for a better error trace when calling them from a loop, like in llaccum y full of 0
. I can remove it, but for some tests it would result in a lot more lines (182 in the case of llaccum y full of 0
)
Should I start pinging jacobly each time I do a PR related to bigints ? I have at least 2 of them in preparation. |
I have a better division algorithm ready. I initially planned to put it in a new PR, but I can also add it in this one. What should I do ? |
04c53e7
to
e28ad90
Compare
e28ad90
to
d15ccdb
Compare
A bigger pull request this time.
What this PR does:
llaccum
andllmulLimb
(for both.add
and.sub
), with a special case forReleaseSmall
. There are also versions ofllmulLimb
for the.add
operation if theadx
cpu feature is supported (needed foradox
andadcx
instructions)llaccum
andllmulLimb
for other architectures, with slight edits from the previous function to have a better codegen (forllmulLimb
)llaccum
andllmulLimb
Limb
I hope the addition of assembly version of these functions will be accepted, as I am not sure about the stance on assembly implementations in the std for functions that can be written without.
The current way I handle the assembly is suboptimal, and adding assembly for other architectures or other cpu features is cumbersome. It would probably be better to have a separate file for each architecture whenever these versions are added.
Also, my implementations (at least for
llmulLimb
) can still be improved.Speed comparison on my machine:
Based on some quick tests for various size of randomly generated arrays of
Limb
(from 500 to 10000):llaccum
is ~5.5 times faster than the previous one for addition and ~4.3 times faster for subtractionllmulLimb
is ~2.5 times faster than the previous one (with adx) while it is ~2.2 times faster for both addition (without adx) and subtraction.The new generic version of
llmulLimb
is faster than the previous one but slower than the assembly one (between 30 to 60% slower depending on the operation and whetheradx
is supported)I did not manage to get better codegen for
llaccum
, which is partly why I resorted to assembly.The division part
note: a "2n by n division" mean a division of a bigint with 2n limbs by a bigint of n limbs. Likewise, a "n by 1 division" mean a division of a bigint with n limbs by a bigint of 1 limb.
Also, "gmp (basecase division only)" is a gmp version where I only enabled the base algorithm. Once numbers get large enough, gmp switches to better division algorithms, so it is not a fair comparison, though I left default gmp in the graphs as a reference.
Division performance for 2n by n division is only improved by the new
llmulLimb
(leading to the same curve on the graph). The new division is ~2.2 times faster than the previous one, while being about 34% slower than gmp (basecase division only). Based on a few tests, this is due to gmp's better implementation ofllmulLimb
(which makes sense, since the algorithm used is otherwise near identical).For n by 1 division, however, the new implementation is vastly superior, as it is ~6 times faster than the previous one, and about 32% slower than gmp. I got very similar graphs for n by 2 and n by 3 divisions (and haven't benched beyond). I don't know why the previous algorithm underperforms that much in these cases. Also, I haven't benched the now removed
lldiv0p5
, but the new implementation now doesn't care about half limbs and is probably faster.Benching script here.