Skip to content

jlamypoirier/bigcode-inference-benchmark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bigcode-inference-benchmark

A100 80GB

BLOOM

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1311535104

Throughput (tokens/sec | msec/token)

batch_size HF (fp32) HF (bf16) HF (int8) DS-inference (fp16)
1 77.94 | 12.83 72.50 | 13.79 20.94 | 47.75 104.00 | 9.62
2 155.77 | 6.42 143.44 | 6.97 41.44 | 24.13 206.33 | 4.85
4 319.15 | 3.13 293.06 | 3.41 83.02 | 12.04 418.28 | 2.39
8 596.68 | 1.68 581.10 | 1.72 167.03 | 5.99 828.67 | 1.21
16 1146.25 | 0.87 1147.91 | 0.87 330.12 | 3.03 1652.51 | 0.61
32 2177.47 | 0.46 2356.71 | 0.42 673.33 | 1.49 3280.17 | 0.30
64 2776.93 | 0.36 4784.46 | 0.21 1329.42 | 0.75 6717.77 | 0.15
128 3007.26 | 0.33 8056.59 | 0.12 2491.86 | 0.40 10410.82 | 0.10
256 3758.11 | 0.27 10339.00 | 0.10 4325.98 | 0.23 12707.62 | 0.08
384 3658.51 | 0.27 11091.67 | 0.09 5628.15 | 0.18 13483.54 | 0.07
512 3775.92 | 0.26 11332.58 | 0.09 6675.52 | 0.15 13930.89 | 0.07
640 3938.85 | 0.25 11534.74 | 0.09 7472.39 | 0.13 14399.86 | 0.07
768 3886.59 | 0.26 11354.37 | 0.09 8220.54 | 0.12 14656.84 | 0.07
896 3728.33 | 0.27 11286.69 | 0.09 8686.16 | 0.12 14540.19 | 0.07
1024 oom 11692.32 | 0.09 9012.79 | 0.11 14390.77 | 0.07
1152 oom 11894.50 | 0.08 9147.50 | 0.11 oom
1280 oom 11731.85 | 0.09 9507.04 | 0.11 oom
1408 oom 11802.63 | 0.08 9711.69 | 0.10 oom
1536 oom 11857.12 | 0.08 9873.34 | 0.10 oom
1664 oom 11932.68 | 0.08 9756.13 | 0.10 oom
1792 oom 11653.63 | 0.09 9814.68 | 0.10 oom
1920 oom oom oom oom

Latency (sec)

batch_size HF (fp32) HF (bf16) HF (int8) DS-inference (fp16)
1 1.28 1.38 4.77 0.96
2 1.28 1.39 4.83 0.97
4 1.25 1.36 4.82 0.96
8 1.34 1.38 4.79 0.97
16 1.40 1.39 4.85 0.97
32 1.47 1.36 4.75 0.98
64 2.30 1.34 4.81 0.95
128 4.26 1.59 5.14 1.23
256 6.81 2.48 5.92 2.01
384 10.50 3.46 6.82 2.85
512 13.56 4.52 7.67 3.68
640 16.25 5.55 8.56 4.44
768 19.76 6.76 9.34 5.24
896 24.03 7.94 10.32 6.16
1024 oom 8.76 11.36 7.12
1152 oom 9.69 12.59 oom
1280 oom 10.91 13.46 oom
1408 oom 11.93 14.50 oom
1536 oom 12.95 15.56 oom
1664 oom 13.94 17.06 oom
1792 oom 15.38 18.26 oom
1920 oom oom oom oom

GPT2 Multi-Head Attention

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1315725312

Throughput (tokens/sec | msec/token)

batch_size HF (fp32) HF (bf16) HF (int8) DS-inference (fp16)
1 63.55 | 15.73 61.24 | 16.33 47.77 | 20.93 196.14 | 5.10
2 124.17 | 8.05 121.47 | 8.23 95.23 | 10.50 399.42 | 2.50
4 248.62 | 4.02 243.92 | 4.10 186.14 | 5.37 809.35 | 1.24
8 481.43 | 2.08 496.29 | 2.01 374.49 | 2.67 1651.31 | 0.61
16 907.02 | 1.10 973.43 | 1.03 742.21 | 1.35 3234.25 | 0.31
32 1706.28 | 0.59 1900.97 | 0.53 1454.42 | 0.69 6360.31 | 0.16
64 2433.37 | 0.41 3489.45 | 0.29 2707.92 | 0.37 12591.66 | 0.08
128 2930.07 | 0.34 5709.92 | 0.18 4732.49 | 0.21 19875.11 | 0.05
256 3584.40 | 0.28 8668.65 | 0.12 7462.20 | 0.13 24630.32 | 0.04
384 3888.22 | 0.26 10376.45 | 0.10 8898.32 | 0.11 27435.64 | 0.04
512 3778.97 | 0.26 10988.53 | 0.09 10325.84 | 0.10 29318.43 | 0.03
640 4124.22 | 0.24 11454.54 | 0.09 10937.53 | 0.09 oom
768 3986.02 | 0.25 11427.95 | 0.09 11552.58 | 0.09 oom
896 3990.40 | 0.25 11360.73 | 0.09 11842.71 | 0.08 oom
1024 oom 11837.35 | 0.09 12085.76 | 0.08 oom
1152 oom 11926.65 | 0.08 12101.75 | 0.08 oom
1280 oom 12149.19 | 0.08 12282.53 | 0.08 oom
1408 oom 12220.05 | 0.08 12294.24 | 0.08 oom
1536 oom 12255.80 | 0.08 12331.86 | 0.08 oom
1664 oom 12369.72 | 0.08 12456.47 | 0.08 oom
1792 oom 12234.69 | 0.08 12063.65 | 0.08 oom
1920 oom oom oom oom

Latency (sec)

batch_size HF (fp32) HF (bf16) HF (int8) DS-inference (fp16)
1 1.57 1.63 2.09 0.51
2 1.61 1.65 2.10 0.50
4 1.61 1.64 2.15 0.49
8 1.66 1.61 2.14 0.48
16 1.76 1.64 2.16 0.49
32 1.88 1.68 2.10 0.50
64 2.63 1.83 2.36 0.51
128 4.37 2.24 2.70 0.64
256 7.14 2.95 3.43 1.04
384 9.88 3.70 4.32 1.40
512 13.55 4.66 4.96 1.75
640 15.52 5.59 5.85 oom
768 19.27 6.72 6.65 oom
896 22.45 7.89 7.57 oom
1024 oom 8.65 8.47 oom
1152 oom 9.66 9.52 oom
1280 oom 10.54 10.42 oom
1408 oom 11.52 11.45 oom
1536 oom 12.53 12.46 oom
1664 oom 13.45 13.36 oom
1792 oom 14.65 14.85 oom
1920 oom oom oom oom

GPT2 Multi-Query Attention

hidden_size = 2048
n_head = 16
n_layer = 24
total_params = 1126889472

Throughput (tokens/sec | msec/token)

batch_size HF (fp32) HF (bf16) HF (int8)
1 72.61 | 13.77 68.89 | 14.52 54.68 | 18.29
2 139.03 | 7.19 133.32 | 7.50 106.70 | 9.37
4 275.54 | 3.63 273.12 | 3.66 213.83 | 4.68
8 538.85 | 1.86 556.67 | 1.80 432.10 | 2.31
16 1015.47 | 0.98 1096.44 | 0.91 846.28 | 1.18
32 1863.15 | 0.54 2194.91 | 0.46 1663.86 | 0.60
64 3009.88 | 0.33 4167.02 | 0.24 3192.54 | 0.31
128 3399.45 | 0.29 6856.43 | 0.15 5928.43 | 0.17
256 4208.59 | 0.24 11002.50 | 0.09 9938.01 | 0.10
512 4559.72 | 0.22 13727.93 | 0.07 13850.24 | 0.07
1024 4969.87 | 0.20 15122.67 | 0.07 15604.99 | 0.06
2048 5090.85 | 0.20 16014.17 | 0.06 16298.18 | 0.06
4096 5212.22 | 0.19 16570.20 | 0.06 16884.37 | 0.06
8192 5268.96 | 0.19 16781.00 | 0.06 17088.02 | 0.06
16384 oom 16874.13 | 0.06 17159.74 | 0.06
32768 oom oom oom

Latency (sec)

batch_size HF (fp32) HF (bf16) HF (int8)
1 1.38 1.45 1.83
2 1.44 1.50 1.87
4 1.45 1.46 1.87
8 1.48 1.44 1.85
16 1.58 1.46 1.89
32 1.72 1.46 1.92
64 2.13 1.54 2.00
128 3.77 1.87 2.16
256 6.08 2.33 2.58
512 11.23 3.73 3.70
1024 20.60 6.77 6.56
2048 40.23 12.79 12.57
4096 78.58 24.72 24.26
8192 155.48 48.82 47.94
16384 oom 97.10 95.48
32768 oom oom oom

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 79.0%
  • Makefile 18.7%
  • Shell 2.3%