Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster VSHA256? #3787

Closed
nigoroll opened this issue Mar 8, 2022 · 16 comments
Closed

Faster VSHA256? #3787

nigoroll opened this issue Mar 8, 2022 · 16 comments

Comments

@nigoroll
Copy link
Member

nigoroll commented Mar 8, 2022

ZeSecretProject (slash/fellow storage engine) is going to use SHA256 a lot to ensure on-disk data integrity.
Benchmarking the current code with perf on cache-misses only, VSHA256_Transform is the top CPU consumer:

Samples: 993K of event 'cycles', 99 Hz, Event count (approx.): 54471260991 lost: 0/0 drop: 0/0
Overhead  Shared Object                                                                     Symbol
  16.46%  varnishd                                                                          [.] VSHA256_Transform                                            ◆
   4.80%  libc-2.31.so                                                                      [.] __vfprintf_internal                                          ▒
   2.93%  varnishd                                                                          [.] VSLbv                                                        ▒
   2.65%  libc-2.31.so                                                                      [.] _IO_default_xsputn                                           ▒
   2.62%  varnishd                                                                          [.] WS_Assert                                                    ▒
   1.97%  varnishd                                                                          [.] hcb_insert                                                   ▒
   1.65%  libc-2.31.so                                                                      [.] __GI___printf_fp_l                                           ▒
   1.61%  varnishd                                                                          [.] http_findhdr                                                 ▒
   1.55%  varnishd                                                                          [.] vbe32dec                                                     ▒
   1.36%  varnishd                                                                          [.] HSH_Lookup                                                   ▒
   1.22%  varnishd                                                                          [.] CNT_Request                                                  ▒

So I looked for alternative, faster implementations and this post led me to https://github.com/intel/isa-l_crypto

To get a rough idea of the relevance, I hacked their benchmark code to include vsha256 ...

diff --git a/sha256_mb/sha256_mb_vs_ossl_perf.c b/sha256_mb/sha256_mb_vs_ossl_perf.c
index 51759d7..44af5fd 100644
--- a/sha256_mb/sha256_mb_vs_ossl_perf.c
+++ b/sha256_mb/sha256_mb_vs_ossl_perf.c
@@ -30,9 +30,22 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <openssl/sha.h>
+// XXX
+#undef SHA256_DIGEST_LENGTH
+#include <vsha256.h>
 #include "sha256_mb.h"
 #include "test.h"
 
+static void
+vsha256(const void *p, size_t l, unsigned char hash[VSHA256_LEN])
+{
+       VSHA256_CTX ctx;
+
+       VSHA256_Init(&ctx);
+       VSHA256_Update(&ctx, p, l);
+       VSHA256_Final(hash, &ctx);
+}
+
 // Set number of outstanding jobs
 #define TEST_BUFS 32
 
@@ -80,6 +93,17 @@ int main(void)
        }
        sha256_ctx_mgr_init(mgr);
 
+       // Start Varnish VSHA256 tests
+       perf_start(&start);
+       for (t = 0; t < TEST_LOOPS; t++) {
+               for (i = 0; i < TEST_BUFS; i++)
+                       vsha256(bufs[i], TEST_LEN, digest_ssl[i]);
+       }
+       perf_stop(&stop);
+
+       printf("vsha256_varnish" TEST_TYPE_STR ": ");
+       perf_print(stop, start, (long long)TEST_LEN * i * t);
+
        // Start OpenSSL tests
        perf_start(&start);
        for (t = 0; t < TEST_LOOPS; t++) {

... and got this result (varnish compiled with clang -g -O2, clang 11.0.1-2):

slink@haggis21:~/Devel/isa-l_crypto/sha256_mb (master)$ gcc sha256_mb_vs_ossl_perf.c -march=native -O3 -Wall -I../include ../.libs/libisal_crypto.a -lcrypto -I/tmp/include/varnish /home/slink/Devel/varnish-git/varnish-cache/lib/libvarnish/.libs/libvarnish.a
slink@haggis21:~/Devel/isa-l_crypto/sha256_mb (master)$ ./a.out 
vsha256_varnish_cold: runtime =    6826217 usecs, bandwidth 640 MB in 6.8262 sec = 98.31 MB/s
sha256_openssl_cold: runtime =    1208562 usecs, bandwidth 640 MB in 1.2086 sec = 555.28 MB/s
multibinary_sha256_cold: runtime =     380778 usecs, bandwidth 640 MB in 0.3808 sec = 1762.41 MB/s
Multi-buffer sha256 test complete 32 buffers of 1048576 B with 20 iterations
 multibinary_sha256_ossl_perf: Pass

So I wonder if I should pull in some custom code into ZeSecretProject or if there would be any interest in adding a faster implementation with platform specific assembly code? For the particular project mentioned, the license is quite permissive, but would require inclusion of it.

@dridi
Copy link
Member

dridi commented Mar 9, 2022

Assuming we'd eventually use openssl for TLS support, would the 5x throughput increase be enough to marginalize the SHA256 CPU footprint in your workload?

@nigoroll
Copy link
Member Author

nigoroll commented Mar 9, 2022

@dridi I think so, yes.

@dridi
Copy link
Member

dridi commented Mar 9, 2022

Since we already have a breakage planned for VUT, we could also retire vsha256 in favor of libcrypto as part of the next soname bump for libvarnishapi.

@nigoroll
Copy link
Member Author

nigoroll commented Mar 9, 2022

correction with apologies:

Though -O2 was in fact in the clang calls to compile libvarnish, there was a later -O0 from CFLAGS overriding it, which I had not noticed.

Compiling again with -O3, the relative speed of vsha is not as bad as initially reported, but still there is quite a difference:

  • compiling libvarnish:
$ make V=1 libvarnish.la 
....
/bin/bash ../../libtool  --tag=CC   --mode=compile clang -DHAVE_CONFIG_H -I. -I../..  -I../../include -I../../include   -DVARNISH_STATE_DIR='"/var/run"'  -g -O3 -MT libvarnish_la-vsha256.lo -MD -MP -MF .deps/libvarnish_la-vsha256.Tpo -c -o libvarnish_la-vsha256.lo `test -f 'vsha256.c' || echo './'`vsha256.c
libtool: compile:  clang -DHAVE_CONFIG_H -I. -I../.. -I../../include -I../../include -DVARNISH_STATE_DIR=\"/var/run\" -g -O3 -MT libvarnish_la-vsha256.lo -MD -MP -MF .deps/libvarnish_la-vsha256.Tpo -c vsha256.c  -fPIC -DPIC -o .libs/libvarnish_la-vsha256.o
mv -f .deps/libvarnish_la-vsha256.Tpo .deps/libvarnish_la-vsha256.Plo
  • benchmark result:
slink@haggis21:~/Devel/isa-l_crypto/sha256_mb (master)$ gcc sha256_mb_vs_ossl_perf.c -march=native -O3 -Wall -I../include ../.libs/libisal_crypto.a -lcrypto -I/tmp/include/varnish /home/slink/Devel/varnish-git/varnish-cache/lib/libvarnish/.libs/libvarnish.a
slink@haggis21:~/Devel/isa-l_crypto/sha256_mb (master)$ ./a.out 
vsha256_varnish_cold: runtime =    2178673 usecs, bandwidth 640 MB in 2.1787 sec = 308.03 MB/s
sha256_openssl_cold: runtime =    1203413 usecs, bandwidth 640 MB in 1.2034 sec = 557.65 MB/s
multibinary_sha256_cold: runtime =     384072 usecs, bandwidth 640 MB in 0.3841 sec = 1747.30 MB/s
Multi-buffer sha256 test complete 32 buffers of 1048576 B with 20 iterations
 multibinary_sha256_ossl_perf: Pass

@mbgrydeland
Copy link
Contributor

I was intrigued by the performance numbers shown for that SHA implementation.

If I understand it correctly, this library achieves its impressive numbers only when it can be handling multiple separate hashes in parallel. That is, there needs to be a need to simultaneously produce many distinct SHA256 in the same core. That doesn't match well with the way hashes are used in Varnish, where a single hash is produced in one thread and the result of that is needed to move forward.

I did a naive hack to the perf example to make it non-parallel to get a feeling how it compares to libcrypto in something more similar to our use cases. To achieve this I asked it to flush the mgr immediately after each data set is added.

iff --git a/sha256_mb/sha256_mb_vs_ossl_perf.c b/sha256_mb/sha256_mb_vs_ossl_perf.c
index 51759d7..58530ac 100644
--- a/sha256_mb/sha256_mb_vs_ossl_perf.c
+++ b/sha256_mb/sha256_mb_vs_ossl_perf.c
@@ -94,11 +94,12 @@ int main(void)
 	// Start mb tests
 	perf_start(&start);
 	for (t = 0; t < TEST_LOOPS; t++) {
-		for (i = 0; i < TEST_BUFS; i++)
-			sha256_ctx_mgr_submit(mgr,
-					      &ctxpool[i], bufs[i], TEST_LEN, HASH_ENTIRE);
-
-		while (sha256_ctx_mgr_flush(mgr)) ;
+		for (i = 0; i < TEST_BUFS; i++) {
+			sha256_ctx_mgr_submit(mgr, &ctxpool[i],
+			    bufs[i], TEST_LEN, HASH_ENTIRE);
+			while (sha256_ctx_mgr_flush(mgr)) ;
+		}
+		/* while (sha256_ctx_mgr_flush(mgr)) ; */
 	}
 	perf_stop(&stop);
sha256_openssl_cold: runtime =    1522753 usecs, bandwidth 640 MB in 1.5228 sec = 440.71 MB/s
multibinary_sha256_cold: runtime =    2021062 usecs, bandwidth 640 MB in 2.0211 sec = 332.05 MB/s
Multi-buffer sha256 test complete 32 buffers of 1048576 B with 20 iterations
 multibinary_sha256_ossl_perf: Pass

So unless I did something very wrong, I think it performs worse than libcrypto in the non-parallel use case, and that libcrypto is still the better drop in replacement to VSHA256.

@nigoroll
Copy link
Member Author

@mbgrydeland good point, see also intel/isa-l_crypto#45

To begin with, my primary question really was if there would be any interest in adding a faster implementation and if platform specific assembly code was at all something we would consider to pull in. If the answer was no, we would not even need to look into any more details of isa-l.

Regarding how we could best integrate this specific implementation, I have not looked at the details yet. For my use case, hashing multiple objects in parallel would actually be an option. In fact, I already considered parallelizing VSHA256 over multiple threads, and with the mb implementation it seems possible to avoid multiple threads and still get a speed up.

Overall, a faster VSHA256 would be good, and while isa-l just appeared particularly attractive at first sight, I have no stakes in any particular implementation.

@mbgrydeland
Copy link
Contributor

mbgrydeland commented Mar 10, 2022

I second using a faster SHA256, and I believe libcrypto's implementation does all the hard parts of figuring the level of processor support and such at runtime already. So I would love to see a configure check if libcrypto is available and use it if so. In my opinion bringing in libcrypto is justifiable regardless of any feelings of OpenSSL in general.

For my use case, hashing multiple objects in parallel would actually be an option.

I assume that for your use case the cryptographic qualities of SHA256 are not important. If it is only data consistency check that you are after, there are much faster alternatives available. Maybe have a look at https://github.com/Cyan4973/xxHash

@nigoroll
Copy link
Member Author

Maybe have a look at https://github.com/Cyan4973/xxHash

thank you for the pointer :) The current choice of SHA256 was primarily for being very well known and accepted to reduce FUD potential. But you are right that I should consider other options :)

@dridi
Copy link
Member

dridi commented Mar 10, 2022

So I would love to see a configure check if libcrypto is available and use it if so. In my opinion bringing in libcrypto is justifiable regardless of any feelings of OpenSSL in general.

See #3787 (comment) and #3787 (comment), I would advise against adding ifdefery and simply migrate our SHA256 usage to libcrypto.

We already agreed that regardless of how we thinkfeel about openssl, we could use it in varnishtest, with varnishd backends, and even varnishd client connections to the extent that we wouldn't touch the certificates.

@nigoroll
Copy link
Member Author

FTR: When considering alternatives, we should also look at a realistic benchmark comparison for the primary varnish-cache use case of creating the hash for cache lookup.The results could differ significantly from the benchmarks on bulk data.

@nigoroll
Copy link
Member Author

bugwash: @mbgrydeland volunteered to look after benchmarks, thank you

@mbgrydeland
Copy link
Contributor

I've some benchmark of VSHA256, OpenSSL's SHA256 (and the one from isa-l_crypto, though that one is likely a bit unfair, see below). The program I've used tries to mimick the way we use digests for hash lookups. That means I've selected two smallish sizes for a URL and a Host-name, and hash these together with a '\0' byte after each to produce one digest.

The program was compiled with:

$ cd isa-l_crypto/sha256_mb
$ gcc sha_perf_vsha.c -march=native -O3 -I../include ../.libs/libisal_crypto.a -lcrypto -I/path/to/varnishbuild/include /path/to/varnishbuild/lib/libvarnish/.libs/libvarnish.a

The libvarnish.a was compiled using the standard options (ie not a debug build, and using -O2).

When using TEST_LEN_URL==39 and TEST_LEN_HOST==11:

$ ./a.out 
vsha256_cold: runtime =    4516057 usecs, bandwidth 639 MB in 4.5161 sec = 148.60 MB/s
sha256_openssl_cold: runtime =    3117478 usecs, bandwidth 639 MB in 3.1175 sec = 215.27 MB/s
multibinary_sha256_cold: runtime =    4205865 usecs, bandwidth 639 MB in 4.2059 sec = 159.56 MB/s
 multibinary_sha256_ossl_perf: Pass

When using TEST_LEN_URL==89 and TEST_LEN_HOST==67:

$ ./a.out 
vsha256_cold: runtime =    3885209 usecs, bandwidth 639 MB in 3.8852 sec = 172.73 MB/s
sha256_openssl_cold: runtime =    2470226 usecs, bandwidth 639 MB in 2.4702 sec = 271.67 MB/s
multibinary_sha256_cold: runtime =    3457578 usecs, bandwidth 639 MB in 3.4576 sec = 194.09 MB/s
 multibinary_sha256_ossl_perf: Pass

It looks like libcrypto's SHA256 is ~50% faster than VSHA on my laptop's i7-7500U.

The numbers for isa-l_crypto are a bit unfair as it includes a endianness byte swap after each run. That byteswap is likely naive and not optimised. The byteswap is needed to get the same results as the other implements. Maybe I'm using the library wrong and there's a smarter way to get the same results. Without it the numbers were on-par with libcrypto.

The program used as a diff against sha256_mb_vs_ossl_perf.c in the isa-l_crypto repo:

--- sha256_mb_vs_ossl_perf.c	2022-03-22 14:16:29.796055575 +0100
+++ sha_perf_vsha.c	2022-03-22 15:15:08.874388744 +0100
@@ -29,35 +29,84 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 #include <openssl/sha.h>
 #include "sha256_mb.h"
 #include "test.h"
 
-// Set number of outstanding jobs
-#define TEST_BUFS 32
-
-#ifdef CACHED_TEST
-// Loop many times over same data
-#  define TEST_LEN     4*1024
-#  define TEST_LOOPS   4000
-#  define TEST_TYPE_STR "_warm"
-#else
 // Uncached test.  Pull from large mem base.
-#  define GT_L3_CACHE  32*1024*1024	/* some number > last level cache */
-#  define TEST_LEN     (GT_L3_CACHE / TEST_BUFS)
-#  define TEST_LOOPS   20
-#  define TEST_TYPE_STR "_cold"
-#endif
 
-#define TEST_MEM TEST_LEN * TEST_BUFS * TEST_LOOPS
+#define TEST_LEN_URL	39
+#define TEST_LEN_HOST	11
+/* #define TEST_LEN_URL	89 */
+/* #define TEST_LEN_HOST	67 */
+#define TEST_LEN	(TEST_LEN_URL + TEST_LEN_HOST)
+#define GT_L3_CACHE	(32*1024*1024)	/* some number > last level cache */
+#define TEST_BUFS	(GT_L3_CACHE / TEST_LEN)
+#define TEST_LOOPS	20
+#define TEST_TYPE_STR	"_cold"
 
 /* Reference digest global to reduce stack usage */
+static uint8_t digest_vsha[TEST_BUFS][4 * SHA256_DIGEST_NWORDS];
 static uint8_t digest_ssl[TEST_BUFS][4 * SHA256_DIGEST_NWORDS];
+static uint8_t digest_mb[TEST_BUFS][4 * SHA256_DIGEST_NWORDS];
+
+static uint8_t nullbyte = '\0';
+
+static void
+test_osha256(const char *p, size_t l, uint8_t hash[SHA256_DIGEST_LENGTH])
+{
+	SHA256_CTX ctx;
+
+	SHA256_Init(&ctx);
+	SHA256_Update(&ctx, p, TEST_LEN_URL);
+	SHA256_Update(&ctx, &nullbyte, 1);
+	SHA256_Update(&ctx, p + TEST_LEN_URL, TEST_LEN_HOST);
+	SHA256_Update(&ctx, &nullbyte, 1);
+	SHA256_Final(hash, &ctx);
+}
+
+#undef SHA256_DIGEST_LENGTH
+#include <vsha256.h>
+
+static void
+test_vsha256(const char *p, size_t l, uint8_t hash[SHA256_DIGEST_LENGTH])
+{
+	VSHA256_CTX ctx;
+
+	VSHA256_Init(&ctx);
+	VSHA256_Update(&ctx, p, TEST_LEN_URL);
+	VSHA256_Update(&ctx, &nullbyte, 1);
+	VSHA256_Update(&ctx, p + TEST_LEN_URL, TEST_LEN_HOST);
+	VSHA256_Update(&ctx, &nullbyte, 1);
+	VSHA256_Final(hash, &ctx);
+}
+
+static void
+test_mb(SHA256_HASH_CTX_MGR *mgr, const char *p, size_t l,
+    uint8_t hash[SHA256_DIGEST_LENGTH])
+{
+	SHA256_HASH_CTX ctx;
+	int i;
+
+	hash_ctx_init(&ctx);
+	sha256_ctx_mgr_submit(mgr, &ctx, p, TEST_LEN_URL, HASH_FIRST);
+	while (sha256_ctx_mgr_flush(mgr)) ;
+	sha256_ctx_mgr_submit(mgr, &ctx, &nullbyte, 1, HASH_UPDATE);
+	while (sha256_ctx_mgr_flush(mgr)) ;
+	sha256_ctx_mgr_submit(mgr, &ctx, p + TEST_LEN_URL, TEST_LEN_HOST,
+	    HASH_UPDATE);
+	while (sha256_ctx_mgr_flush(mgr)) ;
+	sha256_ctx_mgr_submit(mgr, &ctx, &nullbyte, 1, HASH_LAST);
+	while (sha256_ctx_mgr_flush(mgr)) ;
+	for (i = 0; i < SHA256_DIGEST_NWORDS; i++) {
+		((uint32_t *)hash)[i] = to_be32(ctx.job.result_digest[i]);
+	}
+}
 
 int main(void)
 {
 	SHA256_HASH_CTX_MGR *mgr = NULL;
-	SHA256_HASH_CTX ctxpool[TEST_BUFS];
 	unsigned char *bufs[TEST_BUFS];
 	uint32_t i, j, t, fail = 0;
 	struct perf start, stop;
@@ -68,9 +117,6 @@
 			printf("calloc failed test aborted\n");
 			return 1;
 		}
-		// Init ctx contents
-		hash_ctx_init(&ctxpool[i]);
-		ctxpool[i].user_data = (void *)((uint64_t) i);
 	}
 
 	int ret = posix_memalign((void *)&mgr, 16, sizeof(SHA256_HASH_CTX_MGR));
@@ -80,11 +126,22 @@
 	}
 	sha256_ctx_mgr_init(mgr);
 
+	// Start VSHA tests
+	perf_start(&start);
+	for (t = 0; t < TEST_LOOPS; t++) {
+		for (i = 0; i < TEST_BUFS; i++)
+			test_vsha256(bufs[i], TEST_LEN, digest_vsha[i]);
+	}
+	perf_stop(&stop);
+
+	printf("vsha256" TEST_TYPE_STR ": ");
+	perf_print(stop, start, (long long)TEST_LEN * i * t);
+
 	// Start OpenSSL tests
 	perf_start(&start);
 	for (t = 0; t < TEST_LOOPS; t++) {
 		for (i = 0; i < TEST_BUFS; i++)
-			SHA256(bufs[i], TEST_LEN, digest_ssl[i]);
+			test_osha256(bufs[i], TEST_LEN, digest_ssl[i]);
 	}
 	perf_stop(&stop);
 
@@ -95,10 +152,7 @@
 	perf_start(&start);
 	for (t = 0; t < TEST_LOOPS; t++) {
 		for (i = 0; i < TEST_BUFS; i++)
-			sha256_ctx_mgr_submit(mgr,
-					      &ctxpool[i], bufs[i], TEST_LEN, HASH_ENTIRE);
-
-		while (sha256_ctx_mgr_flush(mgr)) ;
+			test_mb(mgr, bufs[i], TEST_LEN, digest_mb[i]);
 	}
 	perf_stop(&stop);
 
@@ -107,19 +161,12 @@
 
 	for (i = 0; i < TEST_BUFS; i++) {
 		for (j = 0; j < SHA256_DIGEST_NWORDS; j++) {
-			if (ctxpool[i].job.result_digest[j] !=
-			    to_be32(((uint32_t *) digest_ssl[i])[j])) {
+			if (memcmp(digest_mb[i], digest_ssl[i],
+				SHA256_DIGEST_LENGTH))
 				fail++;
-				printf("Test%d, digest%d fail %08X <=> %08X\n",
-				       i, j, ctxpool[i].job.result_digest[j],
-				       to_be32(((uint32_t *) digest_ssl[i])[j]));
-			}
 		}
 	}
 
-	printf("Multi-buffer sha256 test complete %d buffers of %d B with "
-	       "%d iterations\n", TEST_BUFS, TEST_LEN, TEST_LOOPS);
-
 	if (fail)
 		printf("Test failed function check %d\n", fail);
 	else

@dridi
Copy link
Member

dridi commented Mar 22, 2022

FYI if someone opens a pull request, this comment from @simonvik on IRC:

if someone ends up writing a patch to replace the VSHA256 with libcrypto then feel free to poke me, i would be happy to test it in production :)

@simonvik
Copy link
Contributor

I patched varnish to use libcrypto instead of VSHA256 with the following patch: https://allg.one/2dZm
I dont see any performance gain/loss and CPU usage/Memory usage stayed the same on the server.

The server i tested on handles about 2k req/s, ive not done any more performance tests than this.

It's ofc possible that i failed with the patch in some way!

@bsdphk
Copy link
Contributor

bsdphk commented Mar 28, 2022

I'm not thrilled about adding a dependency, but there are compelling arguments for it.

Let's see an actual patch incl. live-testing

@nigoroll
Copy link
Member Author

IIRC @mbgrydeland wanted to suggest using libcrypt, but we do not need the ticket for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants