Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

模型支持情况 #6

Open
hipudding opened this issue Jul 18, 2024 · 5 comments
Open

模型支持情况 #6

hipudding opened this issue Jul 18, 2024 · 5 comments

Comments

@hipudding
Copy link
Owner

No description provided.

@hipudding hipudding added enhancement New feature or request and removed enhancement New feature or request labels Jul 18, 2024
@xuedinge233
Copy link

xuedinge233 commented Jul 25, 2024

该栏为fp16模型

AquilaChat2-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17617.83 ms
llama_print_timings:      sample time =      42.23 ms /   171 runs   (    0.25 ms per token,  4048.87 tokens per second)
llama_print_timings: prompt eval time =     256.89 ms /    12 tokens (   21.41 ms per token,    46.71 tokens per second)
llama_print_timings:        eval time =   18107.33 ms /   170 runs   (  106.51 ms per token,     9.39 tokens per second)
llama_print_timings:       total time =   18571.04 ms /   182 tokens

Baichuan-7b

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4524.57 ms
llama_print_timings:      sample time =      33.82 ms /   230 runs   (    0.15 ms per token,  6801.31 tokens per second)
llama_print_timings: prompt eval time =      99.14 ms /     3 tokens (   33.05 ms per token,    30.26 tokens per second)
llama_print_timings:        eval time =   21482.68 ms /   229 runs   (   93.81 ms per token,    10.66 tokens per second)
llama_print_timings:       total time =   21779.25 ms /   232 tokens

Baichuan2-7B-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5655.99 ms
llama_print_timings:      sample time =      40.30 ms /   169 runs   (    0.24 ms per token,  4193.55 tokens per second)
llama_print_timings: prompt eval time =     109.55 ms /     3 tokens (   36.52 ms per token,    27.39 tokens per second)
llama_print_timings:        eval time =   18734.22 ms /   168 runs   (  111.51 ms per token,     8.97 tokens per second)
llama_print_timings:       total time =   19093.11 ms /   171 tokens

bitnet_b1_58-large

llama_new_context_with_model: graph nodes  = 1038
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    3811.16 ms
llama_print_timings:      sample time =      37.51 ms /   411 runs   (    0.09 ms per token, 10957.95 tokens per second)
llama_print_timings: prompt eval time =     120.95 ms /    14 tokens (    8.64 ms per token,   115.75 tokens per second)
llama_print_timings:        eval time =   14369.24 ms /   410 runs   (   35.05 ms per token,    28.53 tokens per second)
llama_print_timings:       total time =   14638.89 ms /   424 tokens

bloom-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2148.48 ms
llama_print_timings:      sample time =     153.77 ms /   327 runs   (    0.47 ms per token,  2126.55 tokens per second)
llama_print_timings: prompt eval time =      95.86 ms /     3 tokens (   31.95 ms per token,    31.29 tokens per second)
llama_print_timings:        eval time =    6109.00 ms /   326 runs   (   18.74 ms per token,    53.36 tokens per second)
llama_print_timings:       total time =    6683.08 ms /   329 tokens

bloomz-alpaca-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2102.40 ms
llama_print_timings:      sample time =     124.05 ms /   140 runs   (    0.89 ms per token,  1128.58 tokens per second)
llama_print_timings: prompt eval time =      92.97 ms /     3 tokens (   30.99 ms per token,    32.27 tokens per second)
llama_print_timings:        eval time =    3810.32 ms /   139 runs   (   27.41 ms per token,    36.48 tokens per second)
llama_print_timings:       total time =    4354.99 ms /   142 tokens

c4ai-command-r-35B-v01

ggml_backend_cann_buffer_type_alloc_buffer: allocating 131072.00 MiB on device 0: aclrtMalloc failed: EL0004: 2024-08-07-03:20:37.605.085 Failed to allocate memory.

chatglm3-6B

llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

chinese-alpaca-2-1.3b

llama_new_context_with_model: graph nodes  = 134
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2646.61 ms
llama_print_timings:      sample time =       5.90 ms /    52 runs   (    0.11 ms per token,  8819.54 tokens per second)
llama_print_timings: prompt eval time =      13.57 ms /     4 tokens (    3.39 ms per token,   294.85 tokens per second)
llama_print_timings:        eval time =     587.68 ms /    51 runs   (   11.52 ms per token,    86.78 tokens per second)
llama_print_timings:       total time =     636.00 ms /    55 tokens

CodeShell-7B

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'codeshell'
llama_load_model_from_file: failed to load model

main: error: unable to load model

deepseek-ai_deepseek-coder-1.3B-base

llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
GGML_ASSERT: llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2738: freq_scale == 1

deepseek-ai_DeepSeek-V2-Lite

llama_load_model_from_file: failed to load model

deepseek-coder-6.7B-instruct

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2874: GGML_ASSERT(freq_scale == 1) failed

DeepSeek-V2-Lite-64x1.5B

ggml_backend_cann_buffer_type_alloc_buffer: allocating 43200.00 MiB on device 0: aclrtMalloc failed: EL0004: 2024-08-07-03:32:11.250.365 Failed to allocate memory.

falcon-7b-instruct

llama_new_context_with_model: graph nodes  = 1064
llama_new_context_with_model: graph splits = 5

llama_print_timings:        load time =    3579.89 ms
llama_print_timings:      sample time =      54.87 ms /   426 runs   (    0.13 ms per token,  7764.23 tokens per second)
llama_print_timings: prompt eval time =     103.95 ms /     3 tokens (   34.65 ms per token,    28.86 tokens per second)
llama_print_timings:        eval time =   42128.11 ms /   425 runs   (   99.12 ms per token,    10.09 tokens per second)
llama_print_timings:       total time =   42540.90 ms /   428 tokens

flan-t5-large

llama_new_context_with_model: graph nodes  = 1350
llama_new_context_with_model: graph splits = 50

llama_print_timings:        load time =    4267.80 ms
llama_print_timings:      sample time =       1.02 ms /    11 runs   (    0.09 ms per token, 10805.50 tokens per second)
llama_print_timings: prompt eval time =     236.70 ms /    14 tokens (   16.91 ms per token,    59.15 tokens per second)
llama_print_timings:        eval time =    1541.36 ms /    10 runs   (  154.14 ms per token,     6.49 tokens per second)
llama_print_timings:       total time =    1831.53 ms /    24 tokens

gemma-2-9b-it

llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 134

llama_print_timings:        load time =    6625.28 ms
llama_print_timings:      sample time =     103.01 ms /   215 runs   (    0.48 ms per token,  2087.12 tokens per second)
llama_print_timings: prompt eval time =     653.53 ms /    12 tokens (   54.46 ms per token,    18.36 tokens per second)
llama_print_timings:        eval time =  116924.79 ms /   214 runs   (  546.38 ms per token,     1.83 tokens per second)
llama_print_timings:       total time =  118032.51 ms /   226 tokens

glm-4-9B

llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 76
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

gpt2

llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    1527.76 ms
llama_print_timings:      sample time =      41.72 ms /   464 runs   (    0.09 ms per token, 11121.76 tokens per second)
llama_print_timings: prompt eval time =       8.11 ms /    10 tokens (    0.81 ms per token,  1233.20 tokens per second)
llama_print_timings:        eval time =    3041.72 ms /   463 runs   (    6.57 ms per token,   152.22 tokens per second)
llama_print_timings:       total time =    3210.82 ms /   473 tokens

Gpt2-163M

llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    1547.14 ms
llama_print_timings:      sample time =     128.09 ms /   902 runs   (    0.14 ms per token,  7041.98 tokens per second)
llama_print_timings: prompt eval time =       7.98 ms /     3 tokens (    2.66 ms per token,   375.80 tokens per second)
llama_print_timings:        eval time =    6048.59 ms /   901 runs   (    6.71 ms per token,   148.96 tokens per second)
llama_print_timings:       total time =    6536.07 ms /   904 tokens

granite-3B-code-instruct

llama_new_context_with_model: graph nodes  = 1254
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    9268.55 ms
llama_print_timings:      sample time =      40.72 ms /   419 runs   (    0.10 ms per token, 10290.54 tokens per second)
llama_print_timings: prompt eval time =     191.60 ms /    13 tokens (   14.74 ms per token,    67.85 tokens per second)
llama_print_timings:        eval time =   23063.06 ms /   418 runs   (   55.17 ms per token,    18.12 tokens per second)
llama_print_timings:       total time =   23480.14 ms /   431 tokens

GritLM-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14846.17 ms
llama_print_timings:      sample time =      14.95 ms /   203 runs   (    0.07 ms per token, 13577.69 tokens per second)
llama_print_timings: prompt eval time =     408.45 ms /    14 tokens (   29.17 ms per token,    34.28 tokens per second)
llama_print_timings:        eval time =   23418.09 ms /   202 runs   (  115.93 ms per token,     8.63 tokens per second)
llama_print_timings:       total time =   23967.31 ms /   216 tokens

internlm2_5-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4672.50 ms
llama_print_timings:      sample time =     107.71 ms /   619 runs   (    0.17 ms per token,  5747.02 tokens per second)
llama_print_timings: prompt eval time =     107.34 ms /     4 tokens (   26.83 ms per token,    37.27 tokens per second)
llama_print_timings:        eval time =   74359.65 ms /   618 runs   (  120.32 ms per token,     8.31 tokens per second)
llama_print_timings:       total time =   75065.44 ms /   622 tokens

koala-7B-HF

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5712.55 ms
llama_print_timings:      sample time =      22.68 ms /   285 runs   (    0.08 ms per token, 12566.14 tokens per second)
llama_print_timings: prompt eval time =     224.33 ms /    14 tokens (   16.02 ms per token,    62.41 tokens per second)
llama_print_timings:        eval time =   25667.68 ms /   284 runs   (   90.38 ms per token,    11.06 tokens per second)
llama_print_timings:       total time =   26090.58 ms /   298 tokens

Llama-2-7b-chat-hf

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5047.41 ms
llama_print_timings:      sample time =      12.59 ms /   203 runs   (    0.06 ms per token, 16120.07 tokens per second)
llama_print_timings: prompt eval time =      88.86 ms /     4 tokens (   22.21 ms per token,    45.02 tokens per second)
llama_print_timings:        eval time =   17060.30 ms /   202 runs   (   84.46 ms per token,    11.84 tokens per second)
llama_print_timings:       total time =   17246.09 ms /   206 tokens

Llama-3-Smaug-8B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    6697.66 ms
llama_print_timings:      sample time =      47.47 ms /   211 runs   (    0.22 ms per token,  4445.10 tokens per second)
llama_print_timings: prompt eval time =     239.75 ms /    12 tokens (   19.98 ms per token,    50.05 tokens per second)
llama_print_timings:        eval time =   23086.56 ms /   210 runs   (  109.94 ms per token,     9.10 tokens per second)
llama_print_timings:       total time =   23610.61 ms /   222 tokens

Llama2-Chinese-7b-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4783.39 ms
llama_print_timings:      sample time =       1.33 ms /    20 runs   (    0.07 ms per token, 15015.02 tokens per second)
llama_print_timings: prompt eval time =      86.96 ms /     4 tokens (   21.74 ms per token,    46.00 tokens per second)
llama_print_timings:        eval time =    1479.93 ms /    19 runs   (   77.89 ms per token,    12.84 tokens per second)
llama_print_timings:       total time =    1575.68 ms /    23 tokens

Llama3-8B:

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5345.92 ms
llama_print_timings:      sample time =     164.94 ms /   757 runs   (    0.22 ms per token,  4589.49 tokens per second)
llama_print_timings: prompt eval time =     107.62 ms /     3 tokens (   35.87 ms per token,    27.87 tokens per second)
llama_print_timings:        eval time =   74757.08 ms /   756 runs   (   98.89 ms per token,    10.11 tokens per second)
llama_print_timings:       total time =   75755.83 ms /   759 tokens

Llama3-8b-chinese

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4416.09 ms
llama_print_timings:      sample time =     191.02 ms /   815 runs   (    0.23 ms per token,  4266.64 tokens per second)
llama_print_timings: prompt eval time =     113.99 ms /    12 tokens (    9.50 ms per token,   105.28 tokens per second)
llama_print_timings:        eval time =   94514.93 ms /   814 runs   (  116.11 ms per token,     8.61 tokens per second)
llama_print_timings:       total time =   95669.04 ms /   826 tokens

mamba-130m-hf

llama_new_context_with_model: graph nodes  = 896
llama_new_context_with_model: graph splits = 98

llama_print_timings:        load time =    2164.20 ms
llama_print_timings:      sample time =       9.54 ms /    94 runs   (    0.10 ms per token,  9858.42 tokens per second)
llama_print_timings: prompt eval time =     171.81 ms /     3 tokens (   57.27 ms per token,    17.46 tokens per second)
llama_print_timings:        eval time =   15082.74 ms /    93 runs   (  162.18 ms per token,     6.17 tokens per second)
llama_print_timings:       total time =   15394.83 ms /    96 tokens

Mistral-7B-Instruct-v0.2

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5425.28 ms
llama_print_timings:      sample time =      28.65 ms /   452 runs   (    0.06 ms per token, 15778.82 tokens per second)
llama_print_timings: prompt eval time =      99.93 ms /     4 tokens (   24.98 ms per token,    40.03 tokens per second)
llama_print_timings:        eval time =   42817.31 ms /   451 runs   (   94.94 ms per token,    10.53 tokens per second)
llama_print_timings:       total time =   43174.05 ms /   455 tokens

Mixtral-8x7B-Instruct-v0.1

llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model

MPT-7B

llama_new_context_with_model: graph nodes  = 998
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    6165.89 ms
llama_print_timings:      sample time =      18.44 ms /   115 runs   (    0.16 ms per token,  6236.10 tokens per second)
llama_print_timings: prompt eval time =     598.30 ms /     3 tokens (  199.43 ms per token,     5.01 tokens per second)
llama_print_timings:        eval time =   39220.02 ms /   114 runs   (  344.04 ms per token,     2.91 tokens per second)
llama_print_timings:       total time =   39927.86 ms /   117 tokens

OLMo-1B-hf

llama_new_context_with_model: graph nodes  = 485
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2089.80 ms
llama_print_timings:      sample time =      80.37 ms /   888 runs   (    0.09 ms per token, 11048.49 tokens per second)
llama_print_timings: prompt eval time =      23.13 ms /     3 tokens (    7.71 ms per token,   129.71 tokens per second)
llama_print_timings:        eval time =   17637.38 ms /   887 runs   (   19.88 ms per token,    50.29 tokens per second)
llama_print_timings:       total time =   18359.15 ms /   890 tokens

OpenELM-3B-Instruct

llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 40

llama_print_timings:        load time =    7971.17 ms
llama_print_timings:      sample time =      13.79 ms /   204 runs   (    0.07 ms per token, 14797.62 tokens per second)
llama_print_timings: prompt eval time =     248.84 ms /    14 tokens (   17.77 ms per token,    56.26 tokens per second)
llama_print_timings:        eval time =   38242.92 ms /   203 runs   (  188.39 ms per token,     5.31 tokens per second)
llama_print_timings:       total time =   38701.01 ms /   217 tokens

Orion-14b-base

llama_new_context_with_model: graph nodes  = 1367
llama_new_context_with_model: graph splits = 109

llama_print_timings:        load time =    6934.82 ms
llama_print_timings:      sample time =       2.72 ms /    16 runs   (    0.17 ms per token,  5878.03 tokens per second)
llama_print_timings: prompt eval time =     479.38 ms /     3 tokens (  159.79 ms per token,     6.26 tokens per second)
llama_print_timings:        eval time =    8197.59 ms /    15 runs   (  546.51 ms per token,     1.83 tokens per second)
llama_print_timings:       total time =    8762.51 ms /    18 tokens

phi: 1.6 GB

llama_new_context_with_model: graph nodes  = 1225
llama_new_context_with_model: graph splits = 262
GGML_ASSERT: /home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2869: n_dims == src0->ne[0]

phi3

llama_print_timings:        load time =   10651.15 ms
llama_print_timings:      sample time =      16.56 ms /   267 runs   (    0.06 ms per token, 16122.21 tokens per second)
llama_print_timings: prompt eval time =     271.12 ms /    13 tokens (   20.86 ms per token,    47.95 tokens per second)
llama_print_timings:        eval time =   17371.79 ms /   266 runs   (   65.31 ms per token,    15.31 tokens per second)
llama_print_timings:       total time =   17783.67 ms /   279 tokens

Phi-3-mini-4k-instruct

llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    3435.29 ms
llama_print_timings:      sample time =      65.95 ms /  1036 runs   (    0.06 ms per token, 15709.11 tokens per second)
llama_print_timings: prompt eval time =      57.82 ms /     3 tokens (   19.27 ms per token,    51.89 tokens per second)
llama_print_timings:        eval time =   57456.74 ms /  1035 runs   (   55.51 ms per token,    18.01 tokens per second)
llama_print_timings:       total time =   57883.99 ms /  1038 tokens

plamo-13b

llama_new_context_with_model: graph nodes  = 1207
llama_new_context_with_model: graph splits = 84

llama_print_timings:        load time =    6040.96 ms
llama_print_timings:      sample time =      10.74 ms /    87 runs   (    0.12 ms per token,  8100.56 tokens per second)
llama_print_timings: prompt eval time =     465.06 ms /     3 tokens (  155.02 ms per token,     6.45 tokens per second)
llama_print_timings:        eval time =   40436.08 ms /    86 runs   (  470.19 ms per token,     2.13 tokens per second)
llama_print_timings:       total time =   41418.68 ms /    89 tokens

pythia-70M

llama_new_context_with_model: graph nodes  = 247
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

Qwen-7B

无对话

llama_new_context_with_model: graph nodes  = 1190
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    6440.68 ms
llama_print_timings:      sample time =      22.51 ms /    82 runs   (    0.27 ms per token,  3642.18 tokens per second)
llama_print_timings: prompt eval time =     101.38 ms /     3 tokens (   33.79 ms per token,    29.59 tokens per second)
llama_print_timings:        eval time =    7525.73 ms /    81 runs   (   92.91 ms per token,    10.76 tokens per second)
llama_print_timings:       total time =    7830.80 ms /    84 tokens

Qwen2-1.5B-Instruct

llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    3015.90 ms
llama_print_timings:      sample time =     118.16 ms /   413 runs   (    0.29 ms per token,  3495.32 tokens per second)
llama_print_timings: prompt eval time =      33.73 ms /     3 tokens (   11.24 ms per token,    88.95 tokens per second)
llama_print_timings:        eval time =   13045.02 ms /   412 runs   (   31.66 ms per token,    31.58 tokens per second)
llama_print_timings:       total time =   14089.94 ms /   415 tokens

Refact-1_6B-fim

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    2437.26 ms
llama_print_timings:      sample time =     136.40 ms /  1474 runs   (    0.09 ms per token, 10806.77 tokens per second)
llama_print_timings: prompt eval time =      30.96 ms /     3 tokens (   10.32 ms per token,    96.91 tokens per second)
llama_print_timings:        eval time =   44879.15 ms /  1473 runs   (   30.47 ms per token,    32.82 tokens per second)
llama_print_timings:       total time =   45544.14 ms /  1476 tokens

SmolLM-135M

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2484.58 ms
llama_print_timings:      sample time =      23.64 ms /   260 runs   (    0.09 ms per token, 10997.84 tokens per second)
llama_print_timings: prompt eval time =     111.69 ms /    13 tokens (    8.59 ms per token,   116.39 tokens per second)
llama_print_timings:        eval time =   10351.73 ms /   259 runs   (   39.97 ms per token,    25.02 tokens per second)
llama_print_timings:       total time =   10578.13 ms /   272 tokens

stablelm-zephyr

llama_new_context_with_model: graph nodes  = 1095
llama_new_context_with_model: graph splits = 453
GGML_ASSERT: /home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2869: n_dims == src0->ne[0]

stablelm-2-zephyr-1_6b

llama_new_context_with_model: graph nodes  = 895
llama_new_context_with_model: graph splits = 2
GGML_ASSERT: /home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2869: n_dims == src0->ne[0]

starcoderbase-1b

llama_new_context_with_model: graph nodes  = 897
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2280.41 ms
llama_print_timings:      sample time =      16.73 ms /   188 runs   (    0.09 ms per token, 11239.99 tokens per second)
llama_print_timings: prompt eval time =      17.98 ms /     3 tokens (    5.99 ms per token,   166.85 tokens per second)
llama_print_timings:        eval time =    2705.69 ms /   187 runs   (   14.47 ms per token,    69.11 tokens per second)
llama_print_timings:       total time =    2844.02 ms /   190 tokens

starcoder2-3b

llama_new_context_with_model: graph nodes  = 1147
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    3105.13 ms
llama_print_timings:      sample time =      63.02 ms /   703 runs   (    0.09 ms per token, 11154.48 tokens per second)
llama_print_timings: prompt eval time =      49.07 ms /     3 tokens (   16.36 ms per token,    61.13 tokens per second)
llama_print_timings:        eval time =   29535.23 ms /   702 runs   (   42.07 ms per token,    23.77 tokens per second)
llama_print_timings:       total time =   30016.59 ms /   705 tokens

vigogne-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    5525.54 ms
llama_print_timings:      sample time =      16.34 ms /   241 runs   (    0.07 ms per token, 14752.69 tokens per second)
llama_print_timings: prompt eval time =      86.63 ms /     4 tokens (   21.66 ms per token,    46.17 tokens per second)
llama_print_timings:        eval time =   19694.88 ms /   240 runs   (   82.06 ms per token,    12.19 tokens per second)
llama_print_timings:       total time =   19921.14 ms /   244 tokens

xverse-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   11282.73 ms
llama_print_timings:      sample time =      21.92 ms /    36 runs   (    0.61 ms per token,  1642.41 tokens per second)
llama_print_timings: prompt eval time =     200.30 ms /     4 tokens (   50.07 ms per token,    19.97 tokens per second)
llama_print_timings:        eval time =    7218.53 ms /    35 runs   (  206.24 ms per token,     4.85 tokens per second)
llama_print_timings:       total time =    7531.61 ms /    39 tokens

Yi-6b

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4214.99 ms
llama_print_timings:      sample time =      26.62 ms /   215 runs   (    0.12 ms per token,  8075.72 tokens per second)
llama_print_timings: prompt eval time =      91.59 ms /     3 tokens (   30.53 ms per token,    32.75 tokens per second)
llama_print_timings:        eval time =   18935.03 ms /   214 runs   (   88.48 ms per token,    11.30 tokens per second)
llama_print_timings:       total time =   19259.19 ms /   217 tokens

@xuedinge233
Copy link

xuedinge233 commented Aug 2, 2024

该栏为q8_0模型

AquilaChat2-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14172.64 ms
llama_print_timings:      sample time =      15.77 ms /    86 runs   (    0.18 ms per token,  5453.05 tokens per second)
llama_print_timings: prompt eval time =     226.01 ms /    12 tokens (   18.83 ms per token,    53.09 tokens per second)
llama_print_timings:        eval time =    8768.69 ms /    85 runs   (  103.16 ms per token,     9.69 tokens per second)
llama_print_timings:       total time =    9072.91 ms /    97 tokens

Baichuan-7b

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   15636.78 ms
llama_print_timings:      sample time =     260.13 ms /  2037 runs   (    0.13 ms per token,  7830.79 tokens per second)
llama_print_timings: prompt eval time =     134.31 ms /     3 tokens (   44.77 ms per token,    22.34 tokens per second)
llama_print_timings:        eval time =  139540.69 ms /  2036 runs   (   68.54 ms per token,    14.59 tokens per second)
llama_print_timings:       total time =  141840.10 ms /  2039 tokens

Baichuan2-7B-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    9200.75 ms
llama_print_timings:      sample time =      24.53 ms /    87 runs   (    0.28 ms per token,  3546.68 tokens per second)
llama_print_timings: prompt eval time =      69.46 ms /     3 tokens (   23.15 ms per token,    43.19 tokens per second)
llama_print_timings:        eval time =    5561.69 ms /    86 runs   (   64.67 ms per token,    15.46 tokens per second)
llama_print_timings:       total time =    5756.60 ms /    89 tokens

bitnet_b1_58-large

llama_new_context_with_model: graph nodes  = 1038
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    4676.28 ms
llama_print_timings:      sample time =      20.44 ms /   274 runs   (    0.07 ms per token, 13406.40 tokens per second)
llama_print_timings: prompt eval time =     484.67 ms /    14 tokens (   34.62 ms per token,    28.89 tokens per second)
llama_print_timings:        eval time =   24936.31 ms /   273 runs   (   91.34 ms per token,    10.95 tokens per second)
llama_print_timings:       total time =   25530.79 ms /   287 tokens

bloom-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 2

CANN error: EZ1001: 2024-08-05-08:55:26.176.494 k,n shouldn't be larger than 65535, actual k is 1024, n is 250880. When x is transposed, m shouldn't be larger than 65535, actual m is 1.

bloomz-alpaca-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 2

CANN error: EZ1001: 2024-08-05-08:57:28.332.043 k,n shouldn't be larger than 65535, actual k is 1024, n is 250681. When x is transposed, m shouldn't be larger than 65535, actual m is 1.

c4ai-command-r-35B-v01

llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

chatglm3-6B

llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

chinese-alpaca-2-1.3b

llama_new_context_with_model: graph nodes  = 134
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    3347.20 ms
llama_print_timings:      sample time =       2.58 ms /    24 runs   (    0.11 ms per token,  9295.12 tokens per second)
llama_print_timings: prompt eval time =       8.44 ms /     4 tokens (    2.11 ms per token,   473.71 tokens per second)
llama_print_timings:        eval time =     152.94 ms /    23 runs   (    6.65 ms per token,   150.39 tokens per second)
llama_print_timings:       total time =     170.69 ms /    27 tokens

CodeShell-7B

llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 145

llama_print_timings:        load time =    7334.21 ms
llama_print_timings:      sample time =      20.82 ms /   147 runs   (    0.14 ms per token,  7060.52 tokens per second)
llama_print_timings: prompt eval time =     343.63 ms /     3 tokens (  114.54 ms per token,     8.73 tokens per second)
llama_print_timings:        eval time =   61675.44 ms /   146 runs   (  422.43 ms per token,     2.37 tokens per second)
llama_print_timings:       total time =   62207.93 ms /   149 tokens

deepseek-ai_DeepSeek-V2-Lite

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './q8model/Deepseek-ai_DeepSeek-V2-Lite-Q8_0.gguf'
main: error: unable to load model

deepseek-ai_deepseek-coder-1.3B-base

llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2738: GGML_ASSERT(freq_scale == 1) failed

deepseek-coder-6.7B-instruct

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2874: GGML_ASSERT(freq_scale == 1) failed

DeepSeek-V2-Lite-64x1.5B

ggml_gallocr_reserve_n: failed to allocate CANN buffer of size 5742268416
llama_new_context_with_model: failed to allocate compute buffers

falcon-7b-instruct

llama_new_context_with_model: graph nodes  = 1064
llama_new_context_with_model: graph splits = 5

llama_print_timings:        load time =    8929.19 ms
llama_print_timings:      sample time =      21.21 ms /   163 runs   (    0.13 ms per token,  7686.87 tokens per second)
llama_print_timings: prompt eval time =      60.40 ms /     3 tokens (   20.13 ms per token,    49.67 tokens per second)
llama_print_timings:        eval time =    8580.85 ms /   162 runs   (   52.97 ms per token,    18.88 tokens per second)
llama_print_timings:       total time =    8784.50 ms /   165 tokens

flan-t5-large

llama_new_context_with_model: graph nodes  = 1350
llama_new_context_with_model: graph splits = 50

llama_print_timings:        load time =    3555.25 ms
llama_print_timings:      sample time =       1.70 ms /    30 runs   (    0.06 ms per token, 17605.63 tokens per second)
llama_print_timings: prompt eval time =     352.01 ms /    14 tokens (   25.14 ms per token,    39.77 tokens per second)
llama_print_timings:        eval time =    6103.26 ms /    29 runs   (  210.46 ms per token,     4.75 tokens per second)
llama_print_timings:       total time =    6532.52 ms /    43 tokens

gemma-2-9b-it

llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 134

llama_print_timings:        load time =   11715.28 ms
llama_print_timings:      sample time =      35.36 ms /    61 runs   (    0.58 ms per token,  1725.31 tokens per second)
llama_print_timings: prompt eval time =     469.56 ms /     4 tokens (  117.39 ms per token,     8.52 tokens per second)
llama_print_timings:        eval time =   33161.32 ms /    60 runs   (  552.69 ms per token,     1.81 tokens per second)
llama_print_timings:       total time =   33768.47 ms /    64 tokens

glm-4-9B

llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 76
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

gpt2

gpt2-163M-F16

llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    1969.92 ms
llama_print_timings:      sample time =      73.50 ms /   714 runs   (    0.10 ms per token,  9714.42 tokens per second)
llama_print_timings: prompt eval time =       8.69 ms /     3 tokens (    2.90 ms per token,   345.30 tokens per second)
llama_print_timings:        eval time =    4634.70 ms /   713 runs   (    6.50 ms per token,   153.84 tokens per second)
llama_print_timings:       total time =    4942.59 ms /   716 tokens

granite-3B-code-instruct

llama_new_context_with_model: graph nodes  = 1254
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    8560.39 ms
llama_print_timings:      sample time =      21.43 ms /   180 runs   (    0.12 ms per token,  8401.40 tokens per second)
llama_print_timings: prompt eval time =     244.86 ms /    13 tokens (   18.84 ms per token,    53.09 tokens per second)
llama_print_timings:        eval time =   11879.84 ms /   179 runs   (   66.37 ms per token,    15.07 tokens per second)
llama_print_timings:       total time =   12231.84 ms /   192 tokens

GritLM-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17855.17 ms
llama_print_timings:      sample time =       1.40 ms /    18 runs   (    0.08 ms per token, 12838.80 tokens per second)
llama_print_timings: prompt eval time =     667.30 ms /    14 tokens (   47.66 ms per token,    20.98 tokens per second)
llama_print_timings:        eval time =   10147.88 ms /    17 runs   (  596.93 ms per token,     1.68 tokens per second)
llama_print_timings:       total time =   10861.42 ms /    31 tokens

internlm2_5-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   15020.52 ms
llama_print_timings:      sample time =      11.76 ms /    57 runs   (    0.21 ms per token,  4845.70 tokens per second)
llama_print_timings: prompt eval time =      63.61 ms /     4 tokens (   15.90 ms per token,    62.88 tokens per second)
llama_print_timings:        eval time =    3007.26 ms /    56 runs   (   53.70 ms per token,    18.62 tokens per second)
llama_print_timings:       total time =    3130.63 ms /    60 tokens

koala-7B-HF

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   10576.71 ms
llama_print_timings:      sample time =      13.65 ms /   208 runs   (    0.07 ms per token, 15234.75 tokens per second)
llama_print_timings: prompt eval time =     203.52 ms /    14 tokens (   14.54 ms per token,    68.79 tokens per second)
llama_print_timings:        eval time =   12080.65 ms /   207 runs   (   58.36 ms per token,    17.13 tokens per second)
llama_print_timings:       total time =   12411.86 ms /   221 tokens

Llama-2-7b-chat-hf

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   12910.04 ms
llama_print_timings:      sample time =       0.62 ms /     9 runs   (    0.07 ms per token, 14539.58 tokens per second)
llama_print_timings: prompt eval time =      59.79 ms /     4 tokens (   14.95 ms per token,    66.90 tokens per second)
llama_print_timings:        eval time =     429.67 ms /     8 runs   (   53.71 ms per token,    18.62 tokens per second)
llama_print_timings:       total time =     495.17 ms /    12 tokens

Llama-3-Smaug-8B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   12665.45 ms
llama_print_timings:      sample time =      40.00 ms /   178 runs   (    0.22 ms per token,  4450.11 tokens per second)
llama_print_timings: prompt eval time =     182.78 ms /    12 tokens (   15.23 ms per token,    65.65 tokens per second)
llama_print_timings:        eval time =   12725.04 ms /   177 runs   (   71.89 ms per token,    13.91 tokens per second)
llama_print_timings:       total time =   13167.69 ms /   189 tokens

Llama2-Chinese-7b-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   12405.32 ms
llama_print_timings:      sample time =      34.21 ms /   518 runs   (    0.07 ms per token, 15143.54 tokens per second)
llama_print_timings: prompt eval time =      58.14 ms /     4 tokens (   14.53 ms per token,    68.80 tokens per second)
llama_print_timings:        eval time =   26767.44 ms /   517 runs   (   51.77 ms per token,    19.31 tokens per second)
llama_print_timings:       total time =   27201.44 ms /   521 tokens

Llama3-8B:

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14319.22 ms
llama_print_timings:      sample time =      60.35 ms /   233 runs   (    0.26 ms per token,  3861.07 tokens per second)
llama_print_timings: prompt eval time =      73.87 ms /     4 tokens (   18.47 ms per token,    54.15 tokens per second)
llama_print_timings:        eval time =   16422.73 ms /   232 runs   (   70.79 ms per token,    14.13 tokens per second)
llama_print_timings:       total time =   16881.61 ms /   236 tokens

Llama3-8b-chinese

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   11867.70 ms
llama_print_timings:      sample time =      45.59 ms /   157 runs   (    0.29 ms per token,  3443.81 tokens per second)
llama_print_timings: prompt eval time =      98.65 ms /     3 tokens (   32.88 ms per token,    30.41 tokens per second)
llama_print_timings:        eval time =   10395.93 ms /   156 runs   (   66.64 ms per token,    15.01 tokens per second)
llama_print_timings:       total time =   10768.09 ms /   159 tokens

mamba-130m-hf

llama_new_context_with_model: graph nodes  = 896
llama_new_context_with_model: graph splits = 98

llama_print_timings:        load time =    2595.99 ms
llama_print_timings:      sample time =       8.85 ms /    86 runs   (    0.10 ms per token,  9723.01 tokens per second)
llama_print_timings: prompt eval time =     207.21 ms /     3 tokens (   69.07 ms per token,    14.48 tokens per second)
llama_print_timings:        eval time =   16949.16 ms /    85 runs   (  199.40 ms per token,     5.01 tokens per second)
llama_print_timings:       total time =   17226.91 ms /    88 tokens

Mistral-7B-Instruct-v0.2

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   11694.23 ms
llama_print_timings:      sample time =      30.56 ms /   464 runs   (    0.07 ms per token, 15183.25 tokens per second)
llama_print_timings: prompt eval time =      55.37 ms /     4 tokens (   13.84 ms per token,    72.24 tokens per second)
llama_print_timings:        eval time =   22024.21 ms /   463 runs   (   47.57 ms per token,    21.02 tokens per second)
llama_print_timings:       total time =   22366.38 ms /   467 tokens

Mixtral-8x7B-Instruct-v0.1

llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 164

llama_print_timings:        load time =  120167.02 ms
llama_print_timings:      sample time =       0.07 ms /     1 runs   (    0.07 ms per token, 13513.51 tokens per second)
llama_print_timings: prompt eval time =   29756.15 ms /    15 tokens ( 1983.74 ms per token,     0.50 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   31644.99 ms /    16 tokens

MPT-7B

llama_new_context_with_model: graph nodes  = 998
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14209.05 ms
llama_print_timings:      sample time =      49.03 ms /   255 runs   (    0.19 ms per token,  5200.79 tokens per second)
llama_print_timings: prompt eval time =    1070.51 ms /     3 tokens (  356.84 ms per token,     2.80 tokens per second)
llama_print_timings:        eval time =   29660.83 ms /   254 runs   (  116.77 ms per token,     8.56 tokens per second)
llama_print_timings:       total time =   31087.82 ms /   257 tokens

OLMo-1B-hf

llama_new_context_with_model: graph nodes  = 485
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    3048.12 ms
llama_print_timings:      sample time =      75.91 ms /   839 runs   (    0.09 ms per token, 11052.42 tokens per second)
llama_print_timings: prompt eval time =      18.50 ms /     3 tokens (    6.17 ms per token,   162.21 tokens per second)
llama_print_timings:        eval time =   13343.07 ms /   838 runs   (   15.92 ms per token,    62.80 tokens per second)
llama_print_timings:       total time =   13793.51 ms /   841 tokens

OpenELM-3B-Instruct

llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 40

llama_print_timings:        load time =    9403.88 ms
llama_print_timings:      sample time =      13.47 ms /   143 runs   (    0.09 ms per token, 10612.24 tokens per second)
llama_print_timings: prompt eval time =     275.87 ms /    14 tokens (   19.70 ms per token,    50.75 tokens per second)
llama_print_timings:        eval time =   24572.25 ms /   142 runs   (  173.04 ms per token,     5.78 tokens per second)
llama_print_timings:       total time =   25130.06 ms /   156 tokens

Orion-14b-base

llama_new_context_with_model: graph nodes  = 1367
llama_new_context_with_model: graph splits = 109

llama_print_timings:        load time =   10774.68 ms
llama_print_timings:      sample time =       3.92 ms /    24 runs   (    0.16 ms per token,  6128.70 tokens per second)
llama_print_timings: prompt eval time =     380.54 ms /     3 tokens (  126.85 ms per token,     7.88 tokens per second)
llama_print_timings:        eval time =   11337.53 ms /    23 runs   (  492.94 ms per token,     2.03 tokens per second)
llama_print_timings:       total time =   11843.56 ms /    26 tokens

phi1

llama_new_context_with_model: graph nodes  = 873
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

phi2

llama_new_context_with_model: graph nodes  = 1225
llama_new_context_with_model: graph splits = 6
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

Phi-3-mini-4k-instruct

llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    7077.89 ms
llama_print_timings:      sample time =       1.37 ms /    21 runs   (    0.07 ms per token, 15283.84 tokens per second)
llama_print_timings: prompt eval time =      43.97 ms /     3 tokens (   14.66 ms per token,    68.23 tokens per second)
llama_print_timings:        eval time =     822.16 ms /    20 runs   (   41.11 ms per token,    24.33 tokens per second)
llama_print_timings:       total time =     874.17 ms /    23 tokens

plamo-13b

llama_new_context_with_model: graph nodes  = 1207
llama_new_context_with_model: graph splits = 84

llama_print_timings:        load time =   22352.16 ms
llama_print_timings:      sample time =       5.01 ms /    47 runs   (    0.11 ms per token,  9388.73 tokens per second)
llama_print_timings: prompt eval time =     335.02 ms /     3 tokens (  111.67 ms per token,     8.95 tokens per second)
llama_print_timings:        eval time =   14119.54 ms /    46 runs   (  306.95 ms per token,     3.26 tokens per second)
llama_print_timings:       total time =   14729.79 ms /    49 tokens

pythia-70M

llama_new_context_with_model: graph nodes  = 247
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

Qwen-7B

llama_new_context_with_model: graph nodes  = 1190
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   12961.72 ms
llama_print_timings:      sample time =      69.43 ms /   264 runs   (    0.26 ms per token,  3802.56 tokens per second)
llama_print_timings: prompt eval time =      72.98 ms /     3 tokens (   24.33 ms per token,    41.11 tokens per second)
llama_print_timings:        eval time =   18388.66 ms /   263 runs   (   69.92 ms per token,    14.30 tokens per second)
llama_print_timings:       total time =   19140.20 ms /   266 tokens

Qwen2-1.5B-Instruct

llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2

CANN error: EZ1001: 2024-08-02-07:14:39.296.547 k,n shouldn't be larger than 65535, actual k is 1536, n is 151936. When x is transposed, m shouldn't be larger than 65535, actual m is 1.

Refact-1_6B-fim

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    3176.92 ms
llama_print_timings:      sample time =      83.78 ms /   850 runs   (    0.10 ms per token, 10145.74 tokens per second)
llama_print_timings: prompt eval time =      29.70 ms /     3 tokens (    9.90 ms per token,   101.01 tokens per second)
llama_print_timings:        eval time =   20837.65 ms /   849 runs   (   24.54 ms per token,    40.74 tokens per second)
llama_print_timings:       total time =   21263.47 ms /   852 tokens

SmolLM-135M

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    3401.73 ms
llama_print_timings:      sample time =       2.54 ms /    21 runs   (    0.12 ms per token,  8284.02 tokens per second)
llama_print_timings: prompt eval time =    1167.64 ms /    13 tokens (   89.82 ms per token,    11.13 tokens per second)
llama_print_timings:        eval time =   24642.06 ms /    20 runs   ( 1232.10 ms per token,     0.81 tokens per second)
llama_print_timings:       total time =   25938.58 ms /    33 tokens

stablelm-zephyr

llama_new_context_with_model: graph nodes  = 1095
llama_new_context_with_model: graph splits = 5
/home/jiahao/llamacpp/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2869: GGML_ASSERT(n_dims == src0->ne[0]) failed

stablelm-2-zephyr-1_6b

llama_new_context_with_model: graph nodes  = 895
llama_new_context_with_model: graph splits = 2
/home/jiahao/llamacpp/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2869: GGML_ASSERT(n_dims == src0->ne[0]) failed

starcoderbase-1b

llama_new_context_with_model: graph nodes  = 897
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2867.14 ms
llama_print_timings:      sample time =      20.64 ms /   255 runs   (    0.08 ms per token, 12357.05 tokens per second)
llama_print_timings: prompt eval time =      14.12 ms /     3 tokens (    4.71 ms per token,   212.45 tokens per second)
llama_print_timings:        eval time =    2725.53 ms /   254 runs   (   10.73 ms per token,    93.19 tokens per second)
llama_print_timings:       total time =    2893.66 ms /   257 tokens

starcoder2-3b

llama_new_context_with_model: graph nodes  = 1147
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    4331.99 ms
llama_print_timings:      sample time =      62.02 ms /   708 runs   (    0.09 ms per token, 11415.49 tokens per second)
llama_print_timings: prompt eval time =      31.79 ms /     3 tokens (   10.60 ms per token,    94.37 tokens per second)
llama_print_timings:        eval time =   18816.59 ms /   707 runs   (   26.61 ms per token,    37.57 tokens per second)
llama_print_timings:       total time =   19195.20 ms /   710 tokens

sea-lion

llama_new_context_with_model: graph nodes  = 1417
llama_new_context_with_model: graph splits = 5

llama_print_timings:        load time =   10368.88 ms
llama_print_timings:      sample time =      84.45 ms /   175 runs   (    0.48 ms per token,  2072.23 tokens per second)
llama_print_timings: prompt eval time =     295.65 ms /     3 tokens (   98.55 ms per token,    10.15 tokens per second)
llama_print_timings:        eval time =   19139.75 ms /   174 runs   (  110.00 ms per token,     9.09 tokens per second)
llama_print_timings:       total time =   19775.16 ms /   177 tokens

vigogne-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    8919.50 ms
llama_print_timings:      sample time =      17.43 ms /   246 runs   (    0.07 ms per token, 14115.22 tokens per second)
llama_print_timings: prompt eval time =      56.49 ms /     4 tokens (   14.12 ms per token,    70.81 tokens per second)
llama_print_timings:        eval time =   11742.77 ms /   245 runs   (   47.93 ms per token,    20.86 tokens per second)
llama_print_timings:       total time =   11957.39 ms /   249 tokens

xverse-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   13259.15 ms
llama_print_timings:      sample time =      43.25 ms /   162 runs   (    0.27 ms per token,  3745.84 tokens per second)
llama_print_timings: prompt eval time =     234.54 ms /     4 tokens (   58.63 ms per token,    17.05 tokens per second)
llama_print_timings:        eval time =   31053.22 ms /   161 runs   (  192.88 ms per token,     5.18 tokens per second)
llama_print_timings:       total time =   31753.42 ms /   165 tokens

Yi-6b-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   11920.54 ms
llama_print_timings:      sample time =      61.45 ms /   496 runs   (    0.12 ms per token,  8071.21 tokens per second)
llama_print_timings: prompt eval time =      57.99 ms /     3 tokens (   19.33 ms per token,    51.74 tokens per second)
llama_print_timings:        eval time =   26753.68 ms /   495 runs   (   54.05 ms per token,    18.50 tokens per second)
llama_print_timings:       total time =   27241.22 ms /   498 tokens

@xuedinge233
Copy link

xuedinge233 commented Aug 2, 2024

模型 FP16 Q8_0 Q4_0
AquilaChat2-7B
Baichuan-7b
Baichuan2-7B-Chat
bitnet_b1_58-large
bloom-560m x
bloomz-alpaca-560m x
c4ai-command-r-35B-v01 x x x
chatglm3-6B x x x
chinese-alpaca-2-1.3b
CodeShell-7B
deepseek-ai_deepseek-coder-1.3B-base x x x
deepseek-ai_DeepSeek-V2-Lite x x x
deepseek-coder-6.7B-instruct x x x
DeepSeek-V2-Lite-64x1.5B x x x
falcon-7b-instruct
flan-t5-large
gemma-2-9b-it
glm-4-9B x x x
gpt2
Gpt2-163M
granite-3B-code-instruct
GritLM-7B
internlm2_5-7b-chat
koala-7B-HF
Llama-2-7b-chat-hf
Llama-3-Smaug-8B
Llama2-Chinese-7b-Chat
Llama3-8B
Llama3-8b-chinese
mamba-130m-hf
Mistral-7B-Instruct-v0.2
Mixtral-8x7B-Instruct-v0.1 X
mpt-7B
OLMo-1B-hf
OpenELM-3B-Instruct
Orion-14b-base
phi1 x x x
phi2 x x x
Phi-3-mini-4k-instruct
plamo-13b
pythia-70M x x x
Qwen-7B
Qwen2-1.5B-Instruct x
Refact-1_6B-fim
SmolLM-135M
stablelm-zephyr x x x
stablelm-2-zephyr-1_6b x x x
starcoderbase-1b
starcoder2-3b
vigogne-7b-chat
xverse-7b-chat
Yi-6b-Chat

@hipudding
Copy link
Owner Author

@wangshuai09 把这个表格贴到readme里吧。但是这些模型中有些能运行,但是切了很多子图,应该是有一些算子还没有实现。这类模型运行就非常慢了。readme中说明下吧。

@xuedinge233
Copy link

该栏为q4_0模型

AquilaChat2-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16274.75 ms
llama_print_timings:      sample time =       7.50 ms /    20 runs   (    0.38 ms per token,  2666.67 tokens per second)
llama_print_timings: prompt eval time =     217.93 ms /     3 tokens (   72.64 ms per token,    13.77 tokens per second)
llama_print_timings:        eval time =    3825.74 ms /    19 runs   (  201.35 ms per token,     4.97 tokens per second)
llama_print_timings:       total time =    4073.54 ms /    22 tokens

Baichuan-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17085.35 ms
llama_print_timings:      sample time =      18.37 ms /   101 runs   (    0.18 ms per token,  5497.20 tokens per second)
llama_print_timings: prompt eval time =     298.66 ms /     3 tokens (   99.55 ms per token,    10.04 tokens per second)
llama_print_timings:        eval time =   14805.01 ms /   100 runs   (  148.05 ms per token,     6.75 tokens per second)
llama_print_timings:       total time =   15220.63 ms /   103 tokens

Baichuan2-7B-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   15985.99 ms
llama_print_timings:      sample time =      21.51 ms /    96 runs   (    0.22 ms per token,  4462.83 tokens per second)
llama_print_timings: prompt eval time =     236.15 ms /     3 tokens (   78.72 ms per token,    12.70 tokens per second)
llama_print_timings:        eval time =   10596.45 ms /    95 runs   (  111.54 ms per token,     8.97 tokens per second)
llama_print_timings:       total time =   10982.98 ms /    98 tokens

bitnet_b1_58-large

llama_new_context_with_model: graph nodes  = 1038
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    3258.79 ms
llama_print_timings:      sample time =      22.37 ms /   342 runs   (    0.07 ms per token, 15290.38 tokens per second)
llama_print_timings: prompt eval time =     163.88 ms /    14 tokens (   11.71 ms per token,    85.43 tokens per second)
llama_print_timings:        eval time =   14534.57 ms /   341 runs   (   42.62 ms per token,    23.46 tokens per second)
llama_print_timings:       total time =   14825.76 ms /   355 tokens

bloom-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    2768.41 ms
llama_print_timings:      sample time =     165.08 ms /   266 runs   (    0.62 ms per token,  1611.31 tokens per second)
llama_print_timings: prompt eval time =     148.30 ms /    12 tokens (   12.36 ms per token,    80.92 tokens per second)
llama_print_timings:        eval time =   21961.19 ms /   265 runs   (   82.87 ms per token,    12.07 tokens per second)
llama_print_timings:       total time =   22708.33 ms /   277 tokens

bloomz-alpaca-560m

llama_new_context_with_model: graph nodes  = 898
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    2643.41 ms
llama_print_timings:      sample time =      63.90 ms /   116 runs   (    0.55 ms per token,  1815.31 tokens per second)
llama_print_timings: prompt eval time =     194.42 ms /    12 tokens (   16.20 ms per token,    61.72 tokens per second)
llama_print_timings:        eval time =    7159.36 ms /   115 runs   (   62.26 ms per token,    16.06 tokens per second)
llama_print_timings:       total time =    7599.72 ms /   127 token

bitnet_b1_58-large

llama_new_context_with_model: graph nodes  = 1038
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    3587.51 ms
llama_print_timings:      sample time =      18.61 ms /   184 runs   (    0.10 ms per token,  9887.16 tokens per second)
llama_print_timings: prompt eval time =     135.37 ms /    14 tokens (    9.67 ms per token,   103.42 tokens per second)
llama_print_timings:        eval time =    8296.59 ms /   183 runs   (   45.34 ms per token,    22.06 tokens per second)
llama_print_timings:       total time =    8535.29 ms /   197 tokens

c4ai-command-r-35B-v01

llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

chatglm3-6B

llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 3
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

chinese-alpaca-2-1.3b

llama_new_context_with_model: graph nodes  = 134
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    3694.26 ms
llama_print_timings:      sample time =      16.25 ms /    72 runs   (    0.23 ms per token,  4431.86 tokens per second)
llama_print_timings: prompt eval time =      87.40 ms /    14 tokens (    6.24 ms per token,   160.19 tokens per second)
llama_print_timings:        eval time =    3872.17 ms /    71 runs   (   54.54 ms per token,    18.34 tokens per second)
llama_print_timings:       total time =    4027.03 ms /    85 tokens

CodeShell-7B

llama_new_context_with_model: graph nodes  = 1687
llama_new_context_with_model: graph splits = 145


llama_print_timings:        load time =   12289.16 ms
llama_print_timings:      sample time =       7.43 ms /    64 runs   (    0.12 ms per token,  8609.09 tokens per second)
llama_print_timings: prompt eval time =     446.18 ms /     3 tokens (  148.73 ms per token,     6.72 tokens per second)
llama_print_timings:        eval time =   30866.02 ms /    63 runs   (  489.94 ms per token,     2.04 tokens per second)
llama_print_timings:       total time =   31365.54 ms /    66 tokens

deepseek-ai_deepseek-coder-1.3B-base

llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 3
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2874: GGML_ASSERT(freq_scale == 1) failed

Deepseek-ai_DeepSeek-V2-Lite

llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found
llama_load_model_from_file: failed to load model

deepseek-coder-6.7B-instruct

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2874: GGML_ASSERT(freq_scale == 1) failed

DeepSeek-V2-Lite-64x1.5B

llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 133

/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:2872: GGML_ASSERT(ext_factor == 0) failed

falcon-7b-instruct

llama_new_context_with_model: graph nodes  = 1064
llama_new_context_with_model: graph splits = 5

llama_print_timings:        load time =   16301.73 ms
llama_print_timings:      sample time =      14.80 ms /   123 runs   (    0.12 ms per token,  8312.50 tokens per second)
llama_print_timings: prompt eval time =     222.47 ms /    12 tokens (   18.54 ms per token,    53.94 tokens per second)
llama_print_timings:        eval time =    9187.39 ms /   122 runs   (   75.31 ms per token,    13.28 tokens per second)
llama_print_timings:       total time =    9480.80 ms /   134 tokens

flan-t5-large

llama_new_context_with_model: graph nodes  = 1350
llama_new_context_with_model: graph splits = 51

llama_print_timings:        load time =    4382.19 ms
llama_print_timings:      sample time =       2.25 ms /    44 runs   (    0.05 ms per token, 19520.85 tokens per second)
llama_print_timings: prompt eval time =     424.16 ms /    14 tokens (   30.30 ms per token,    33.01 tokens per second)
llama_print_timings:        eval time =   10177.07 ms /    43 runs   (  236.68 ms per token,     4.23 tokens per second)
llama_print_timings:       total time =   10652.54 ms /    57 tokens

gemma-2-9b-it

llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 134

llama_print_timings:        load time =   18691.81 ms
llama_print_timings:      sample time =      20.58 ms /    33 runs   (    0.62 ms per token,  1603.50 tokens per second)
llama_print_timings: prompt eval time =     978.13 ms /    14 tokens (   69.87 ms per token,    14.31 tokens per second)
llama_print_timings:        eval time =   28235.10 ms /    32 runs   (  882.35 ms per token,     1.13 tokens per second)
llama_print_timings:       total time =   29540.44 ms /    46 tokens

gpt2

llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    1735.80 ms
llama_print_timings:      sample time =      52.06 ms /   465 runs   (    0.11 ms per token,  8931.14 tokens per second)
llama_print_timings: prompt eval time =      80.42 ms /    11 tokens (    7.31 ms per token,   136.79 tokens per second)
llama_print_timings:        eval time =   11035.44 ms /   464 runs   (   23.78 ms per token,    42.05 tokens per second)
llama_print_timings:       total time =   11380.61 ms /   475 tokens

Gpt2-163M

llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    2142.71 ms
llama_print_timings:      sample time =     120.96 ms /  1222 runs   (    0.10 ms per token, 10102.26 tokens per second)
llama_print_timings: prompt eval time =      79.87 ms /     3 tokens (   26.62 ms per token,    37.56 tokens per second)
llama_print_timings:        eval time =   21541.44 ms /  1221 runs   (   17.64 ms per token,    56.68 tokens per second)
llama_print_timings:       total time =   22157.39 ms /  1224 tokens

granite-3B-code-instruct

llama_new_context_with_model: graph nodes  = 1254
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    8526.98 ms
llama_print_timings:      sample time =      31.13 ms /   172 runs   (    0.18 ms per token,  5525.39 tokens per second)
llama_print_timings: prompt eval time =     182.75 ms /    13 tokens (   14.06 ms per token,    71.14 tokens per second)
llama_print_timings:        eval time =   10242.17 ms /   171 runs   (   59.90 ms per token,    16.70 tokens per second)
llama_print_timings:       total time =   19445.59 ms /   184 tokens

GritLM-7B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16952.39 ms
llama_print_timings:      sample time =       6.85 ms /   104 runs   (    0.07 ms per token, 15186.92 tokens per second)
llama_print_timings: prompt eval time =     528.53 ms /    14 tokens (   37.75 ms per token,    26.49 tokens per second)
llama_print_timings:        eval time =   16883.37 ms /   103 runs   (  163.92 ms per token,     6.10 tokens per second)
llama_print_timings:       total time =   17496.54 ms /   117 tokens

internlm2_5-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   18677.69 ms
llama_print_timings:      sample time =      16.52 ms /    95 runs   (    0.17 ms per token,  5749.91 tokens per second)
llama_print_timings: prompt eval time =     318.82 ms /    13 tokens (   24.52 ms per token,    40.78 tokens per second)
llama_print_timings:        eval time =   10308.95 ms /    94 runs   (  109.67 ms per token,     9.12 tokens per second)
llama_print_timings:       total time =   10704.64 ms /   107 tokens

koala-7B-HF

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16436.32 ms
llama_print_timings:      sample time =      20.60 ms /   303 runs   (    0.07 ms per token, 14709.45 tokens per second)
llama_print_timings: prompt eval time =     202.27 ms /    14 tokens (   14.45 ms per token,    69.22 tokens per second)
llama_print_timings:        eval time =   16560.74 ms /   302 runs   (   54.84 ms per token,    18.24 tokens per second)
llama_print_timings:       total time =   16918.99 ms /   316 tokens

Llama-2-7b-chat-hf

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16671.70 ms
llama_print_timings:      sample time =      48.10 ms /   479 runs   (    0.10 ms per token,  9957.80 tokens per second)
llama_print_timings: prompt eval time =     412.34 ms /     4 tokens (  103.08 ms per token,     9.70 tokens per second)
llama_print_timings:        eval time =   30794.44 ms /   478 runs   (   64.42 ms per token,    15.52 tokens per second)
llama_print_timings:       total time =   31691.99 ms /   482 tokens

Llama-3-Smaug-8B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   18414.26 ms
llama_print_timings:      sample time =      67.14 ms /   285 runs   (    0.24 ms per token,  4245.11 tokens per second)
llama_print_timings: prompt eval time =     182.75 ms /    12 tokens (   15.23 ms per token,    65.66 tokens per second)
llama_print_timings:        eval time =   18267.38 ms /   284 runs   (   64.32 ms per token,    15.55 tokens per second)
llama_print_timings:       total time =   18826.90 ms /   296 tokens

Llama2-Chinese-7b-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16393.97 ms
llama_print_timings:      sample time =      27.95 ms /   405 runs   (    0.07 ms per token, 14489.64 tokens per second)
llama_print_timings: prompt eval time =     250.62 ms /    14 tokens (   17.90 ms per token,    55.86 tokens per second)
llama_print_timings:        eval time =   24537.26 ms /   404 runs   (   60.74 ms per token,    16.46 tokens per second)
llama_print_timings:       total time =   25049.45 ms /   418 tokens

Llama-3-8B

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17270.63 ms
llama_print_timings:      sample time =      78.22 ms /   356 runs   (    0.22 ms per token,  4551.38 tokens per second)
llama_print_timings: prompt eval time =     205.24 ms /    13 tokens (   15.79 ms per token,    63.34 tokens per second)
llama_print_timings:        eval time =   24473.91 ms /   355 runs   (   68.94 ms per token,    14.51 tokens per second)
llama_print_timings:       total time =   25150.09 ms /   368 tokens

Llama3-Chinese_v2

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14141.75 ms
llama_print_timings:      sample time =      66.80 ms /   293 runs   (    0.23 ms per token,  4386.16 tokens per second)
llama_print_timings: prompt eval time =     203.97 ms /    12 tokens (   17.00 ms per token,    58.83 tokens per second)
llama_print_timings:        eval time =   20357.02 ms /   292 runs   (   69.72 ms per token,    14.34 tokens per second)
llama_print_timings:       total time =   20912.93 ms /   304 tokens

mamba-130m-hf

llama_new_context_with_model: graph nodes  = 896
llama_new_context_with_model: graph splits = 99

llama_print_timings:        load time =    2507.99 ms
llama_print_timings:      sample time =      12.72 ms /   119 runs   (    0.11 ms per token,  9352.40 tokens per second)
llama_print_timings: prompt eval time =     280.64 ms /    11 tokens (   25.51 ms per token,    39.20 tokens per second)
llama_print_timings:        eval time =   33439.40 ms /   118 runs   (  283.38 ms per token,     3.53 tokens per second)
llama_print_timings:       total time =   33808.38 ms /   129 tokens

Mistral-7B-Instruct-v0.2

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   18450.83 ms
llama_print_timings:      sample time =      12.65 ms /   185 runs   (    0.07 ms per token, 14624.51 tokens per second)
llama_print_timings: prompt eval time =     277.82 ms /    15 tokens (   18.52 ms per token,    53.99 tokens per second)
llama_print_timings:        eval time =   12724.38 ms /   184 runs   (   69.15 ms per token,    14.46 tokens per second)
llama_print_timings:       total time =   13102.38 ms /   199 tokens

Mixtral-8x7B-Instruct-v0.1

llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 164

llama_print_timings:        load time =  166175.02 ms
llama_print_timings:      sample time =       0.22 ms /     2 runs   (    0.11 ms per token,  9049.77 tokens per second)
llama_print_timings: prompt eval time =   61904.79 ms /    15 tokens ( 4126.99 ms per token,     0.24 tokens per second)
llama_print_timings:        eval time =   62650.79 ms /     1 runs   (62650.79 ms per token,     0.02 tokens per second)
llama_print_timings:       total time =  141257.50 ms /    16 tokens

mpt-7B

llama_new_context_with_model: graph nodes  = 998
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17294.85 ms
llama_print_timings:      sample time =      37.13 ms /   359 runs   (    0.10 ms per token,  9669.51 tokens per second)
llama_print_timings: prompt eval time =     202.64 ms /    11 tokens (   18.42 ms per token,    54.28 tokens per second)
llama_print_timings:        eval time =   16751.97 ms /   358 runs   (   46.79 ms per token,    21.37 tokens per second)
llama_print_timings:       total time =   17154.56 ms /   369 tokens

OLMo-1B-hf

llama_new_context_with_model: graph nodes  = 485
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    3999.43 ms
llama_print_timings:      sample time =      80.68 ms /   805 runs   (    0.10 ms per token,  9977.44 tokens per second)
llama_print_timings: prompt eval time =      93.01 ms /    11 tokens (    8.46 ms per token,   118.27 tokens per second)
llama_print_timings:        eval time =   26080.54 ms /   804 runs   (   32.44 ms per token,    30.83 tokens per second)
llama_print_timings:       total time =   26793.27 ms /   815 tokens

OpenELM-3B-Instruct

llama_new_context_with_model: graph nodes  = 1446
llama_new_context_with_model: graph splits = 40

llama_print_timings:        load time =    7830.17 ms
llama_print_timings:      sample time =       9.56 ms /    86 runs   (    0.11 ms per token,  8993.93 tokens per second)
llama_print_timings: prompt eval time =     278.46 ms /    14 tokens (   19.89 ms per token,    50.28 tokens per second)
llama_print_timings:        eval time =   14962.20 ms /    85 runs   (  176.03 ms per token,     5.68 tokens per second)
llama_print_timings:       total time =   15369.48 ms /    99 tokens

Orion-14b-base

llama_new_context_with_model: graph nodes  = 1367
llama_new_context_with_model: graph splits = 109

llama_print_timings:        load time =   27184.31 ms
llama_print_timings:      sample time =       9.23 ms /    54 runs   (    0.17 ms per token,  5850.49 tokens per second)
llama_print_timings: prompt eval time =     596.10 ms /    11 tokens (   54.19 ms per token,    18.45 tokens per second)
llama_print_timings:        eval time =   22107.06 ms /    53 runs   (  417.11 ms per token,     2.40 tokens per second)
llama_print_timings:       total time =   22984.52 ms /    64 tokens

phi1

llama_new_context_with_model: graph nodes  = 873
llama_new_context_with_model: graph splits = 4
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

phi2

llama_new_context_with_model: graph nodes  = 1225
llama_new_context_with_model: graph splits = 6
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

Phi-3-mini-4k-instruct

llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    9979.17 ms
llama_print_timings:      sample time =      20.34 ms /   245 runs   (    0.08 ms per token, 12045.82 tokens per second)
llama_print_timings: prompt eval time =     237.03 ms /    13 tokens (   18.23 ms per token,    54.85 tokens per second)
llama_print_timings:        eval time =   14434.19 ms /   244 runs   (   59.16 ms per token,    16.90 tokens per second)
llama_print_timings:       total time =   14815.99 ms /   257 tokens

plamo-13b

llama_new_context_with_model: graph nodes  = 1207
llama_new_context_with_model: graph splits = 84

llama_print_timings:        load time =   20305.56 ms
llama_print_timings:      sample time =       3.23 ms /    25 runs   (    0.13 ms per token,  7744.73 tokens per second)
llama_print_timings: prompt eval time =     543.69 ms /    23 tokens (   23.64 ms per token,    42.30 tokens per second)
llama_print_timings:        eval time =    9473.01 ms /    24 runs   (  394.71 ms per token,     2.53 tokens per second)
llama_print_timings:       total time =   10343.37 ms /    47 tokens

pythia-70M

llama_new_context_with_model: graph nodes  = 247
llama_new_context_with_model: graph splits = 3
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

Qwen-7B

llama_new_context_with_model: graph nodes  = 1190
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   14989.41 ms
llama_print_timings:      sample time =      55.35 ms /   178 runs   (    0.31 ms per token,  3215.67 tokens per second)
llama_print_timings: prompt eval time =    1428.14 ms /    13 tokens (  109.86 ms per token,     9.10 tokens per second)
llama_print_timings:        eval time =   24520.83 ms /   177 runs   (  138.54 ms per token,     7.22 tokens per second)
llama_print_timings:       total time =   26401.70 ms /   190 tokens

Qwen_Qwen2-1.5B-Instruct

llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    4261.71 ms
llama_print_timings:      sample time =     291.35 ms /   991 runs   (    0.29 ms per token,  3401.43 tokens per second)
llama_print_timings: prompt eval time =     170.72 ms /    13 tokens (   13.13 ms per token,    76.15 tokens per second)
llama_print_timings:        eval time =   56886.72 ms /   990 runs   (   57.46 ms per token,    17.40 tokens per second)
llama_print_timings:       total time =   59458.22 ms /  1003 tokens

Refact-1_6B-fim

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =    4808.68 ms
llama_print_timings:      sample time =     114.62 ms /  1097 runs   (    0.10 ms per token,  9570.51 tokens per second)
llama_print_timings: prompt eval time =     134.46 ms /    13 tokens (   10.34 ms per token,    96.68 tokens per second)
llama_print_timings:        eval time =   31846.51 ms /  1096 runs   (   29.06 ms per token,    34.42 tokens per second)
llama_print_timings:       total time =   32504.55 ms /  1109 tokens

SmolLM-135M

llama_new_context_with_model: graph nodes  = 966
llama_new_context_with_model: graph splits = 2

llama_print_timings:        load time =    2426.14 ms
llama_print_timings:      sample time =      32.30 ms /   357 runs   (    0.09 ms per token, 11050.92 tokens per second)
llama_print_timings: prompt eval time =     116.60 ms /    13 tokens (    8.97 ms per token,   111.50 tokens per second)
llama_print_timings:        eval time =   14398.36 ms /   356 runs   (   40.44 ms per token,    24.73 tokens per second)
llama_print_timings:       total time =   14673.29 ms /   369 tokens

stablelm-2-zephyr-1.6b

llama_new_context_with_model: graph nodes  = 895
llama_new_context_with_model: graph splits = 3
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

stablelm-zephyr-3B

llama_new_context_with_model: graph nodes  = 1095
llama_new_context_with_model: graph splits = 5
/home/jiahao/llamacpp/code/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3005: GGML_ASSERT(n_dims == src0->ne[0]) failed

starcoderbase-1b

llama_new_context_with_model: graph nodes  = 897
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    4196.61 ms
llama_print_timings:      sample time =      45.20 ms /   399 runs   (    0.11 ms per token,  8827.43 tokens per second)
llama_print_timings: prompt eval time =     130.77 ms /    13 tokens (   10.06 ms per token,    99.41 tokens per second)
llama_print_timings:        eval time =   13006.63 ms /   398 runs   (   32.68 ms per token,    30.60 tokens per second)
llama_print_timings:       total time =   13360.53 ms /   411 tokens

starcoder2-3b

llama_new_context_with_model: graph nodes  = 1147
llama_new_context_with_model: graph splits = 3

llama_print_timings:        load time =    7979.76 ms
llama_print_timings:      sample time =      31.10 ms /   300 runs   (    0.10 ms per token,  9646.61 tokens per second)
llama_print_timings: prompt eval time =     186.53 ms /    13 tokens (   14.35 ms per token,    69.69 tokens per second)
llama_print_timings:        eval time =   17188.02 ms /   299 runs   (   57.49 ms per token,    17.40 tokens per second)
llama_print_timings:       total time =   17535.90 ms /   312 tokens

vigogne-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   16423.06 ms
llama_print_timings:      sample time =      20.46 ms /   216 runs   (    0.09 ms per token, 10555.12 tokens per second)
llama_print_timings: prompt eval time =     258.78 ms /    14 tokens (   18.48 ms per token,    54.10 tokens per second)
llama_print_timings:        eval time =   13427.53 ms /   215 runs   (   62.45 ms per token,    16.01 tokens per second)
llama_print_timings:       total time =   13859.67 ms /   229 tokens

xverse-7b-chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   17147.96 ms
llama_print_timings:      sample time =      69.67 ms /   332 runs   (    0.21 ms per token,  4765.39 tokens per second)
llama_print_timings: prompt eval time =     470.28 ms /    18 tokens (   26.13 ms per token,    38.28 tokens per second)
llama_print_timings:        eval time =   23424.93 ms /   331 runs   (   70.77 ms per token,    14.13 tokens per second)
llama_print_timings:       total time =   24618.99 ms /   349 tokens

Yi-6B-Chat

llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

llama_print_timings:        load time =   12033.35 ms
llama_print_timings:      sample time =      89.25 ms /   500 runs   (    0.18 ms per token,  5602.55 tokens per second)
llama_print_timings: prompt eval time =     231.26 ms /    13 tokens (   17.79 ms per token,    56.21 tokens per second)
llama_print_timings:        eval time =   27076.75 ms /   499 runs   (   54.26 ms per token,    18.43 tokens per second)
llama_print_timings:       total time =   27880.02 ms /   512 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants