-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the internal usage of Cstruct.t by the bytes type #146
Conversation
I went through the code, this looks mostly good. Some questions remain, though:
|
Any update on this? Any performance metrics supporting this change? (I'm all in for such a change). |
Hello, thanks for this PR. I currently don't really know how to stresstest it but I found this PR interesting (untested yet, planed if the next weeks). TLDR: I came to the same conclusions, So far I'm on the following journey:
If interested, one should be able to reproduce with: opam pin ocaml-solo5 https://github.com/palainp/ocaml-solo5.git#no-dlmalloc -y
mirage configure -t spt && make depend && dune build
solo5-spt --net:service=service --mem=16 dist/resolver.spt > log.txt
grep requested log.txt | sort | uniq -c # plus maybe some shell magic to unify the allocation of the same size |
7f3fc49
to
f9938f2
Compare
f9938f2
to
bf89829
Compare
I reverted the API changes in 65dbfae -- so the _bytes functions aren't exposed. The reasoning behind this is that I do not like the new names, and would like to avoid introducing them. If necessary, I'm fine to just move everything to string/bytes (with a breaking API change), but introducing new things in the interface makes more work (at some point we have to deprecate and switch the existing ones to bytes)... |
FWIW I'm ok with the deletion of new |
38199d9
to
f6b3613
Compare
Some time later, I force-pushed the commits again - sorry about that. I also removed the Also, since #192 got merged, I managed to gather some benchmarks on my laptop (Lenovo X250, Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz) using main branch:
this PR (8bdd655):
Tl;DR: around a factor of 2. I'm fine to merge this in the current state (again cc @Firobe @dinosaure @palainp) -- there could be further improvements:
I'm fine to leave those as separate PR(s). |
…s type Co-Authored-by: Pierre Alain <[email protected]>
f6b3613
to
8bdd655
Compare
I can confirm on AMD Ryzen 9 7500X a factor of 2 as you on the benchmark. |
Thanks for your comments, @reynir. I guess it makes sense to include the "use string instead of bytes" in here!? (And convert the curve values to strings directly as well.) Especially since we're not in a rush. |
Using string in 25519 (everything only for 25519, numbers in op/s) baseline (OpenSSL): ecdsa-sign 21942.7 ecdsa-verify 7080.6 ecdh: 26200.7 (
|
I also moved NIST to use string -- likely there's still some fine tuning to do (to minimize the calls to out_fe_to_fe) - a full review is likely a good idea. As example, Params.a and Params.b could directly be called with Fe.from_be_bytes Statistics f4991c2:
|
My gut feeling is that our let f = Fe.create () in
Fe.mul f (out_fe_to_fe f) (out_fe_to_fe f); Where let mul a b =
let res = create () in
mul res a b;
out_fe_to_fe res WDYT? |
now there are more sane interfaces... benchmark for 1d81a2a I've no clue about the variation / improvement in performance, maybe we should investigate?
and with 28805b7
|
…ult, use octets instead
Whoah thank you @hannesm for that huge amount of work here. I'm a bit late for the benchmark task, here are my results (current head / latest commit on this pr):
I'm less than 2x performance increase (1.5-2) with P* and almost no gain with 25519 ec (my CPU is a recent i7, but that was done in a Qubes AppVM, maybe I'm lacking the super asm instructions :) ) |
I'm done here. Feedback welcome. Plan to squash & merge on Tuesday Feb 13th. Huge thanks to @dinosaure for initial work, @palainp for rebase & bugfixes, @reynir for review -- and everyone for offline discussions and comments that motivated me to put some more effort into this PR. |
Thanks a lot for putting the work into this PR! (I do also get a factor 2 improvement on the HEAD commit compared to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the huge amount of work you've done here! The quest for more performance will be for another PR here x2 is super already!
I've just noticed the miss of one (const WORD*)String_val
.
So LGTM!
CHANGES: * mirage-crypto, mirage-crypto-rng{,lwt,mirage}: support CL.EXE compiler (mirage/mirage-crypto#137 @jonahbeckford) - mirage-crypto-pk not yet due to gmp dependency, mirage-crypto-ec doesn't pass testsuite * mirage-crypto-ec: use simpler square root for ed25519 - saving 3 multiplications and 2 squarings, details https://mailarchive.ietf.org/arch/msg/cfrg/qlKpMBqxXZYmDpXXIx6LO3Oznv4/ (mirage/mirage-crypto#196 @hannesm) * mirage-crypto-ec: use sliding window method with pre-computed calues of multiples of the generator point for NIST curves, speedup around 4x for P-256 sign (mirage/mirage-crypto#191 @Firobe, review @palainp @hannesm) * mirage-crypto-ec: documentation: warn about power timing analysis on `k` in Dsa.sign (mirage/mirage-crypto#195 @hannesm, as proposed by @edwintorok) * mirage-crypto-ec: replace internal Cstruct.t by string (speedup up to 2.5x) (mirage/mirage-crypto#146 @dinosaure @hannesm @reynir, review @Firobe @palainp @hannesm @reynir) * bench/speed: add EC (ECDSA & EdDSA generate/sign/verify, ECDH secret/share) operations (mirage/mirage-crypto#192 @hannesm) * mirage-crypto-rng: use rdtime instead of rdcycle on RISC-V (rdcycle is privileged since Linux kernel 6.6) (mirage/mirage-crypto#194 @AdrianBunk, review by @edwintorok) * mirage-crypto-rng: support Loongarch (mirage/mirage-crypto#190 @fangyaling, review @loongson-zn) * mirage-crypto-rng: support NetBSD (mirage/mirage-crypto#189 @drchrispinnock) * mirage-crypto-rng: allocate less in Fortuna when feeding (mirage/mirage-crypto#188 @hannesm, reported by @palainp) * mirage-crypto-ec: avoid mirage-crypto-pk and asn1-combinators test dependency (instead, craft our own asn.1 decoder -- mirage/mirage-crypto#200 @hannesm) ### Performance differences between v0.11.2 and v0.11.3 and OpenSSL The overall result is promising: P-256 sign operation improved 9.4 times, but is still a 4.9 times slower than OpenSSL. Numbers in operations per second (apart from speedup, which is a factor v0.11.3 / v0.11.2), gathered on a Intel i7-5600U CPU 2.60GHz using FreeBSD 14.0, OCaml 4.14.1, and OpenSSL 3.0.12. #### P224 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 1160 | 20609 | 17.8 | | | sign | 931 | 8169 | 8.8 | 21319 | | verify | 328 | 1606 | 4.9 | 10719 | | dh-sec | 1011 | 12595 | 12.5 | | | dh-kex | 992 | 2021 | 2.0 | 16691 | #### P256 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 990 | 19365 | 19.6 | | | sign | 792 | 7436 | 9.4 | 36182 | | verify | 303 | 1488 | 4.9 | 13383 | | dh-sec | 875 | 11508 | 13.2 | | | dh-kex | 895 | 1861 | 2.1 | 17742 | #### P384 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 474 | 6703 | 14.1 | | | sign | 349 | 3061 | 8.8 | 900 | | verify | 147 | 544 | 3.7 | 1062 | | dh-sec | 378 | 4405 | 11.7 | | | dh-kex | 433 | 673 | 1.6 | 973 | #### P521 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 185 | 1996 | 10.8 | | | sign | 137 | 438 | 3.2 | 2737 | | verify | 66 | 211 | 3.2 | 1354 | | dh-sec | 180 | 1535 | 8.5 | | | dh-kex | 201 | 268 | 1.3 | 2207 | #### 25519 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 23271 | 22345 | 1.0 | | | sign | 11228 | 10985 | 1.0 | 21794 | | verify | 8149 | 8029 | 1.0 | 7729 | | dh-sec | 14075 | 13968 | 1.0 | | | dh-kex | 13487 | 14079 | 1.0 | 24824 |
Originally, we used Cstruct.t (bigarray) for interfacing. Instead, we use string now. The benefit is that allocating a string is cheap, and in line with OCaml's GC. After some years of stalling, we included benchmarks in bench/speed.ml fot the EC operations in mirage#192 (sign, verify, generate for EC/EdDSA; and ECDH). The result for thi change is a factor between 2 and 2.5. The external API (mirage_crypto_ec.mli) does not change at all. There are various other cleanups in the code, such as providing a layer to isolate the C calls (which receive a bytes buffer for the result value, and thus mutate this buffer) to be immutable. Co-authored-by: Pierre Alain <[email protected]> Co-authored-by: Hannes Mehnert <[email protected]> Co-authored-by: Reynir Björnsson <[email protected]> Reviewed-by: Virgile Robles <[email protected]> Reviewed-by: Pierre Alain <[email protected]>
This PR wants to delete more and more an usage of
Cstruct.t
. The goal of it is to give an opportunity to usebytes
instead ofCstruct.t
. Indeed, in a certain usage,mirage-crypto-ec
allocates too manyCstruct.t
which put a huge pressure on the GC. In some of my stress test, I finally get anOut_of_memory
exception and an introspection of the memory consumption shows me thatmirage-crypto-ec
allocates 26 % of what I used and most of these objects:This PR is a draft to internally uses
bytes
and keep the same interface with a systematic derivation of functions withbytes
. Then, C calls use onlybytes
instead ofCstruct.t
.I will come back with some statistics to see if this PR change something about allocations or not.