Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow comitment to polynomial with blst_from_scratch #202

Open
nazar-pc opened this issue Mar 20, 2023 · 10 comments
Open

Slow comitment to polynomial with blst_from_scratch #202

nazar-pc opened this issue Mar 20, 2023 · 10 comments

Comments

@nazar-pc
Copy link
Contributor

I have migrated code from using https://github.com/dusk-network/plonk to blst_from_scratch and noticed improved performance except for committing to polynomial.

Before:

create-polynomial       time:   [78.910 µs 80.642 µs 82.387 µs]
commit                  time:   [375.98 µs 382.69 µs 389.47 µs]
prove                   time:   [406.50 µs 412.22 µs 417.96 µs]
verify                  time:   [2.6014 ms 2.7458 ms 2.8826 ms]

After:

create-polynomial       time:   [1.7445 µs 1.7448 µs 1.7451 µs]
commit                  time:   [779.59 µs 779.96 µs 780.34 µs]
prove                   time:   [386.50 µs 387.20 µs 387.99 µs]
verify                  time:   [1.7031 ms 1.7032 ms 1.7034 ms]

It is ~2x slower with blst_from_scratch, though I am not yet sure where slowness comes from.

I'm committing to polynomial with 8 coefficients, while FsKZGSettings was initialized with scale of 10.

@belijzajac
Copy link
Collaborator

The function of interest is g1_linear_combination where Pippenger's algorithm is used for larger polynomials and a naive multiplication for polynomials with a degree less than 8.

https://github.com/sifraitech/rust-kzg/blob/b09120d67150af474d7407092691f866a874bac1/blst-from-scratch/src/types/kzg_settings.rs#L56-L65

We discovered some time ago that when committing to polynomials of degree 2^15, the parallelized Variable Base Multiscalar Multiplication approach is much faster than the Pippenger's algorithm, and due to a deadline, we only let arkworks and zkcrypto implement it.

commit to poly benchmark

What we can try to do is port this algorithm that was originally written by ZEXE (now Arkworks?) authors from zkcrypto to blst-from-scratch and mcl (we'll need it for EIP-4844 anyway):

https://github.com/sifraitech/rust-kzg/blob/b09120d67150af474d7407092691f866a874bac1/zkcrypto/src/curve/multiscalar_mul.rs#L183-L268

Though, I doubt we'll see any improvement for polynomials with degrees up to 8.

@sauliusgrigaitis
Copy link
Member

We should try to port MSM to blst_from_scratch. I think we discussed with @belijzajac this before as Go KZG implementation that uses Gnark supporting MSM performs much faster. We should check what other optimizations they use and benchmark against them.

@sauliusgrigaitis
Copy link
Member

I'm committing to polynomial with 8 coefficients, while FsKZGSettings was initialized with scale of 10.

@nazar-pc just to be sure - do your polynomials have only 2^3 coefficients? Those optimizations usually start to show great results on a much higher scale. At the scale 2^3 speed is already µs (except verification which is slightly slower) so you likely will not notice the difference unless you have a special case.

@nazar-pc
Copy link
Contributor Author

The benchmark was using 2^3 coefficients, but we have various use cases, some are 2^15 even.

@sauliusgrigaitis
Copy link
Member

Worth to check this paper. They claim 78% speed over arkworks. Source code is available (seems gnark has this implementation).

@nazar-pc
Copy link
Contributor Author

@dariolina, you can take a look at ^ when you have time

@dariolina
Copy link

Thank you! I have also found this paper that claims 3-20% speedup over blst with precomputation. Maybe I could combine the two.

@belijzajac
Copy link
Collaborator

I found out that there already exists a Pippenger's implementation for supranational/blst rust binding, which appears to use buckets (a large grid for storing the results) and a thread pool for parallelization.

Implemented it via d365801 which achieved the following results:

Benchmark Old New Speedup
blob_to_kzg_commitment 54.145 ms 17.132 ms 3.16x
compute_kzg_proof 65.029 ms 26.509 ms 2.45x
compute_blob_kzg_proof 65.680 ms 27.342 ms 2.40x
bench_commit_to_poly 348.61 ms 97.402 ms 3.58x

However, our test_fk_multi_chunk_len_1_512 test fails (only 1 out of 80), and the java binding also fails 1 out of 222 test vectors, and I'm not entirely certain why:

image

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Apr 5, 2023

For the use case I have in mind single-core performance would be more important.
We are already parallelizing things across multiple cores, so we need less CPU cycles per call (if possible), not necessary less time using any number of cores for each individual call.

@sauliusgrigaitis
Copy link
Member

@belijzajac > great work! This seems related to the small samples. Both tests that crashed for us are because of this. Could you try to reproduce such issue in blst tests? So then we could submit to blst folks.

@nazar-pc there is interesting situation with parallel vs non-parallel. We probably should stick to this concept:

Non-parallel - means strictly single thread, no parallelism neither at cryptographic algorithms, neither in blobs batching.

Parallel - means max parallelism, so parallelism would be used both at cryptographic algorithms and in order to parallelize blobs batching. This is a simple to understand approach, but because of parallel overhead there could be not the best results for some setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants