bigcode-inference-benchmark

A100 80GB

BLOOM

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1311535104

Throughput (tokens/sec | msec/token)

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	77.94 \| 12.83	72.50 \| 13.79	20.94 \| 47.75	104.00 \| 9.62
2	155.77 \| 6.42	143.44 \| 6.97	41.44 \| 24.13	206.33 \| 4.85
4	319.15 \| 3.13	293.06 \| 3.41	83.02 \| 12.04	418.28 \| 2.39
8	596.68 \| 1.68	581.10 \| 1.72	167.03 \| 5.99	828.67 \| 1.21
16	1146.25 \| 0.87	1147.91 \| 0.87	330.12 \| 3.03	1652.51 \| 0.61
32	2177.47 \| 0.46	2356.71 \| 0.42	673.33 \| 1.49	3280.17 \| 0.30
64	2776.93 \| 0.36	4784.46 \| 0.21	1329.42 \| 0.75	6717.77 \| 0.15
128	3007.26 \| 0.33	8056.59 \| 0.12	2491.86 \| 0.40	10410.82 \| 0.10
256	3758.11 \| 0.27	10339.00 \| 0.10	4325.98 \| 0.23	12707.62 \| 0.08
384	3658.51 \| 0.27	11091.67 \| 0.09	5628.15 \| 0.18	13483.54 \| 0.07
512	3775.92 \| 0.26	11332.58 \| 0.09	6675.52 \| 0.15	13930.89 \| 0.07
640	3938.85 \| 0.25	11534.74 \| 0.09	7472.39 \| 0.13	14399.86 \| 0.07
768	3886.59 \| 0.26	11354.37 \| 0.09	8220.54 \| 0.12	14656.84 \| 0.07
896	3728.33 \| 0.27	11286.69 \| 0.09	8686.16 \| 0.12	14540.19 \| 0.07
1024	oom	11692.32 \| 0.09	9012.79 \| 0.11	14390.77 \| 0.07
1152	oom	11894.50 \| 0.08	9147.50 \| 0.11	oom
1280	oom	11731.85 \| 0.09	9507.04 \| 0.11	oom
1408	oom	11802.63 \| 0.08	9711.69 \| 0.10	oom
1536	oom	11857.12 \| 0.08	9873.34 \| 0.10	oom
1664	oom	11932.68 \| 0.08	9756.13 \| 0.10	oom
1792	oom	11653.63 \| 0.09	9814.68 \| 0.10	oom
1920	oom	oom	oom	oom

Latency (sec)

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	1.28	1.38	4.77	0.96
2	1.28	1.39	4.83	0.97
4	1.25	1.36	4.82	0.96
8	1.34	1.38	4.79	0.97
16	1.40	1.39	4.85	0.97
32	1.47	1.36	4.75	0.98
64	2.30	1.34	4.81	0.95
128	4.26	1.59	5.14	1.23
256	6.81	2.48	5.92	2.01
384	10.50	3.46	6.82	2.85
512	13.56	4.52	7.67	3.68
640	16.25	5.55	8.56	4.44
768	19.76	6.76	9.34	5.24
896	24.03	7.94	10.32	6.16
1024	oom	8.76	11.36	7.12
1152	oom	9.69	12.59	oom
1280	oom	10.91	13.46	oom
1408	oom	11.93	14.50	oom
1536	oom	12.95	15.56	oom
1664	oom	13.94	17.06	oom
1792	oom	15.38	18.26	oom
1920	oom	oom	oom	oom

GPT2 Multi-Head Attention

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1315725312

Throughput (tokens/sec | msec/token)

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	63.55 \| 15.73	61.24 \| 16.33	47.77 \| 20.93	196.14 \| 5.10
2	124.17 \| 8.05	121.47 \| 8.23	95.23 \| 10.50	399.42 \| 2.50
4	248.62 \| 4.02	243.92 \| 4.10	186.14 \| 5.37	809.35 \| 1.24
8	481.43 \| 2.08	496.29 \| 2.01	374.49 \| 2.67	1651.31 \| 0.61
16	907.02 \| 1.10	973.43 \| 1.03	742.21 \| 1.35	3234.25 \| 0.31
32	1706.28 \| 0.59	1900.97 \| 0.53	1454.42 \| 0.69	6360.31 \| 0.16
64	2433.37 \| 0.41	3489.45 \| 0.29	2707.92 \| 0.37	12591.66 \| 0.08
128	2930.07 \| 0.34	5709.92 \| 0.18	4732.49 \| 0.21	19875.11 \| 0.05
256	3584.40 \| 0.28	8668.65 \| 0.12	7462.20 \| 0.13	24630.32 \| 0.04
384	3888.22 \| 0.26	10376.45 \| 0.10	8898.32 \| 0.11	27435.64 \| 0.04
512	3778.97 \| 0.26	10988.53 \| 0.09	10325.84 \| 0.10	29318.43 \| 0.03
640	4124.22 \| 0.24	11454.54 \| 0.09	10937.53 \| 0.09	oom
768	3986.02 \| 0.25	11427.95 \| 0.09	11552.58 \| 0.09	oom
896	3990.40 \| 0.25	11360.73 \| 0.09	11842.71 \| 0.08	oom
1024	oom	11837.35 \| 0.09	12085.76 \| 0.08	oom
1152	oom	11926.65 \| 0.08	12101.75 \| 0.08	oom
1280	oom	12149.19 \| 0.08	12282.53 \| 0.08	oom
1408	oom	12220.05 \| 0.08	12294.24 \| 0.08	oom
1536	oom	12255.80 \| 0.08	12331.86 \| 0.08	oom
1664	oom	12369.72 \| 0.08	12456.47 \| 0.08	oom
1792	oom	12234.69 \| 0.08	12063.65 \| 0.08	oom
1920	oom	oom	oom	oom

Latency (sec)

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	1.57	1.63	2.09	0.51
2	1.61	1.65	2.10	0.50
4	1.61	1.64	2.15	0.49
8	1.66	1.61	2.14	0.48
16	1.76	1.64	2.16	0.49
32	1.88	1.68	2.10	0.50
64	2.63	1.83	2.36	0.51
128	4.37	2.24	2.70	0.64
256	7.14	2.95	3.43	1.04
384	9.88	3.70	4.32	1.40
512	13.55	4.66	4.96	1.75
640	15.52	5.59	5.85	oom
768	19.27	6.72	6.65	oom
896	22.45	7.89	7.57	oom
1024	oom	8.65	8.47	oom
1152	oom	9.66	9.52	oom
1280	oom	10.54	10.42	oom
1408	oom	11.52	11.45	oom
1536	oom	12.53	12.46	oom
1664	oom	13.45	13.36	oom
1792	oom	14.65	14.85	oom
1920	oom	oom	oom	oom

GPT2 Multi-Query Attention

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1126889472

Throughput (tokens/sec | msec/token)

batch_size	HF (fp32)	HF (bf16)	HF (int8)
1	72.61 \| 13.77	68.89 \| 14.52	54.68 \| 18.29
2	139.03 \| 7.19	133.32 \| 7.50	106.70 \| 9.37
4	275.54 \| 3.63	273.12 \| 3.66	213.83 \| 4.68
8	538.85 \| 1.86	556.67 \| 1.80	432.10 \| 2.31
16	1015.47 \| 0.98	1096.44 \| 0.91	846.28 \| 1.18
32	1863.15 \| 0.54	2194.91 \| 0.46	1663.86 \| 0.60
64	3009.88 \| 0.33	4167.02 \| 0.24	3192.54 \| 0.31
128	3399.45 \| 0.29	6856.43 \| 0.15	5928.43 \| 0.17
256	4208.59 \| 0.24	11002.50 \| 0.09	9938.01 \| 0.10
512	4559.72 \| 0.22	13727.93 \| 0.07	13850.24 \| 0.07
1024	4969.87 \| 0.20	15122.67 \| 0.07	15604.99 \| 0.06
2048	5090.85 \| 0.20	16014.17 \| 0.06	16298.18 \| 0.06
4096	5212.22 \| 0.19	16570.20 \| 0.06	16884.37 \| 0.06
8192	5268.96 \| 0.19	16781.00 \| 0.06	17088.02 \| 0.06
16384	oom	16874.13 \| 0.06	17159.74 \| 0.06
32768	oom	oom	oom

Latency (sec)

batch_size	HF (fp32)	HF (bf16)	HF (int8)
1	1.38	1.45	1.83
2	1.44	1.50	1.87
4	1.45	1.46	1.87
8	1.48	1.44	1.85
16	1.58	1.46	1.89
32	1.72	1.46	1.92
64	2.13	1.54	2.00
128	3.77	1.87	2.16
256	6.08	2.33	2.58
512	11.23	3.73	3.70
1024	20.60	6.77	6.56
2048	40.23	12.79	12.57
4096	78.58	24.72	24.26
8192	155.48	48.82	47.94
16384	oom	97.10	95.48
32768	oom	oom	oom

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	77.94 \| 12.83	72.50 \| 13.79	20.94 \| 47.75	104.00 \| 9.62
2	155.77 \| 6.42	143.44 \| 6.97	41.44 \| 24.13	206.33 \| 4.85
4	319.15 \| 3.13	293.06 \| 3.41	83.02 \| 12.04	418.28 \| 2.39
8	596.68 \| 1.68	581.10 \| 1.72	167.03 \| 5.99	828.67 \| 1.21
16	1146.25 \| 0.87	1147.91 \| 0.87	330.12 \| 3.03	1652.51 \| 0.61
32	2177.47 \| 0.46	2356.71 \| 0.42	673.33 \| 1.49	3280.17 \| 0.30
64	2776.93 \| 0.36	4784.46 \| 0.21	1329.42 \| 0.75	6717.77 \| 0.15
128	3007.26 \| 0.33	8056.59 \| 0.12	2491.86 \| 0.40	10410.82 \| 0.10
256	3758.11 \| 0.27	10339.00 \| 0.10	4325.98 \| 0.23	12707.62 \| 0.08
384	3658.51 \| 0.27	11091.67 \| 0.09	5628.15 \| 0.18	13483.54 \| 0.07
512	3775.92 \| 0.26	11332.58 \| 0.09	6675.52 \| 0.15	13930.89 \| 0.07
640	3938.85 \| 0.25	11534.74 \| 0.09	7472.39 \| 0.13	14399.86 \| 0.07
768	3886.59 \| 0.26	11354.37 \| 0.09	8220.54 \| 0.12	14656.84 \| 0.07
896	3728.33 \| 0.27	11286.69 \| 0.09	8686.16 \| 0.12	14540.19 \| 0.07
1024	oom	11692.32 \| 0.09	9012.79 \| 0.11	14390.77 \| 0.07
1152	oom	11894.50 \| 0.08	9147.50 \| 0.11	oom
1280	oom	11731.85 \| 0.09	9507.04 \| 0.11	oom
1408	oom	11802.63 \| 0.08	9711.69 \| 0.10	oom
1536	oom	11857.12 \| 0.08	9873.34 \| 0.10	oom
1664	oom	11932.68 \| 0.08	9756.13 \| 0.10	oom
1792	oom	11653.63 \| 0.09	9814.68 \| 0.10	oom
1920	oom	oom	oom	oom

batch_size	HF (fp32)	HF (bf16)	HF (int8)	DS-inference (fp16)
1	63.55 \| 15.73	61.24 \| 16.33	47.77 \| 20.93	196.14 \| 5.10
2	124.17 \| 8.05	121.47 \| 8.23	95.23 \| 10.50	399.42 \| 2.50
4	248.62 \| 4.02	243.92 \| 4.10	186.14 \| 5.37	809.35 \| 1.24
8	481.43 \| 2.08	496.29 \| 2.01	374.49 \| 2.67	1651.31 \| 0.61
16	907.02 \| 1.10	973.43 \| 1.03	742.21 \| 1.35	3234.25 \| 0.31
32	1706.28 \| 0.59	1900.97 \| 0.53	1454.42 \| 0.69	6360.31 \| 0.16
64	2433.37 \| 0.41	3489.45 \| 0.29	2707.92 \| 0.37	12591.66 \| 0.08
128	2930.07 \| 0.34	5709.92 \| 0.18	4732.49 \| 0.21	19875.11 \| 0.05
256	3584.40 \| 0.28	8668.65 \| 0.12	7462.20 \| 0.13	24630.32 \| 0.04
384	3888.22 \| 0.26	10376.45 \| 0.10	8898.32 \| 0.11	27435.64 \| 0.04
512	3778.97 \| 0.26	10988.53 \| 0.09	10325.84 \| 0.10	29318.43 \| 0.03
640	4124.22 \| 0.24	11454.54 \| 0.09	10937.53 \| 0.09	oom
768	3986.02 \| 0.25	11427.95 \| 0.09	11552.58 \| 0.09	oom
896	3990.40 \| 0.25	11360.73 \| 0.09	11842.71 \| 0.08	oom
1024	oom	11837.35 \| 0.09	12085.76 \| 0.08	oom
1152	oom	11926.65 \| 0.08	12101.75 \| 0.08	oom
1280	oom	12149.19 \| 0.08	12282.53 \| 0.08	oom
1408	oom	12220.05 \| 0.08	12294.24 \| 0.08	oom
1536	oom	12255.80 \| 0.08	12331.86 \| 0.08	oom
1664	oom	12369.72 \| 0.08	12456.47 \| 0.08	oom
1792	oom	12234.69 \| 0.08	12063.65 \| 0.08	oom
1920	oom	oom	oom	oom

batch_size	HF (fp32)	HF (bf16)	HF (int8)
1	72.61 \| 13.77	68.89 \| 14.52	54.68 \| 18.29
2	139.03 \| 7.19	133.32 \| 7.50	106.70 \| 9.37
4	275.54 \| 3.63	273.12 \| 3.66	213.83 \| 4.68
8	538.85 \| 1.86	556.67 \| 1.80	432.10 \| 2.31
16	1015.47 \| 0.98	1096.44 \| 0.91	846.28 \| 1.18
32	1863.15 \| 0.54	2194.91 \| 0.46	1663.86 \| 0.60
64	3009.88 \| 0.33	4167.02 \| 0.24	3192.54 \| 0.31
128	3399.45 \| 0.29	6856.43 \| 0.15	5928.43 \| 0.17
256	4208.59 \| 0.24	11002.50 \| 0.09	9938.01 \| 0.10
512	4559.72 \| 0.22	13727.93 \| 0.07	13850.24 \| 0.07
1024	4969.87 \| 0.20	15122.67 \| 0.07	15604.99 \| 0.06
2048	5090.85 \| 0.20	16014.17 \| 0.06	16298.18 \| 0.06
4096	5212.22 \| 0.19	16570.20 \| 0.06	16884.37 \| 0.06
8192	5268.96 \| 0.19	16781.00 \| 0.06	17088.02 \| 0.06
16384	oom	16874.13 \| 0.06	17159.74 \| 0.06
32768	oom	oom	oom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigcode-inference-benchmark

BLOOM

GPT2 Multi-Head Attention

GPT2 Multi-Query Attention

About

Releases

Packages

Languages

License

jlamypoirier/bigcode-inference-benchmark

Folders and files

Latest commit

History

Repository files navigation

bigcode-inference-benchmark

BLOOM

GPT2 Multi-Head Attention

GPT2 Multi-Query Attention

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages