Potential sigverify latency optimization #81

sakridge · 2020-09-15T17:49:25Z

The ed25519 sigverify check does the operation a* A + b * B in a single thread. This is somewhat efficient for the CPU because it saves instructions and stack spill to L1 is not as expensive on CPU. On GPU, since there are so many threads, one could do a *A with one kernel launch and in parallel do b * B. At the end, then do the addition which is pretty cheap. Each of those launches would then use a larger portion of the GPU, but in low-batch situations I think this is preferable to letting a large part of the GPU go to waste. Each scalar multiply would also have much less register pressure since it only has half the temps to deal with.

One might even want to have both options available in case the GPU encounters large vs. small batch if one is more efficient than the other.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential sigverify latency optimization #81

Potential sigverify latency optimization #81

sakridge commented Sep 15, 2020

Potential sigverify latency optimization #81

Potential sigverify latency optimization #81

Comments

sakridge commented Sep 15, 2020