Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: When --parallel 4 is turned ON, the inferring result is apparently like fool .But when --parallel 4 is turned OFF everything is OK ? #8935

Closed
hzgdeerHo opened this issue Aug 8, 2024 · 5 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale

Comments

@hzgdeerHo
Copy link

What happened?

#####CMD which Works Normally:
CUDA_VISIBLE_DEVICES=0 ./llama-server -m /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --gpu-layers 33 -cb --ctx-size 16128 --flash-attn --batch-size 512 --chat-template llama3 --port 8866 --host 0.0.0.0

#####CMD which Works NOT Normally:

CUDA_VISIBLE_DEVICES=0 ./llama-server -m /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf --gpu-layers 33 -cb --parallel 4 --ctx-size 16128 --flash-attn --batch-size 512 --chat-template llama3 --port 8866 --host 0.0.0.0

ubuntu@VM-0-16-ubuntu:~$ nvidia-smi
Thu Aug 8 21:22:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB Off | 00000000:00:08.0 Off | 0 |
| N/A 34C P0 39W / 300W | 10194MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 35134 C ./llama-server 10192MiB |
+---------------------------------------------------------------------------------------+

Name and Version

ubuntu@VM-0-16-ubuntu:/llama.cpp$ ^C
ubuntu@VM-0-16-ubuntu:
/llama.cpp$ ./llama-cli --version
version: 3549 (afd27f0)
built with cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

#####CMD which Works Normally:
CUDA_VISIBLE_DEVICES=0 ./llama-server -m /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf  --gpu-layers 33 -cb --ctx-size 16128    --flash-attn  --batch-size 512 --chat-template llama3  --port 8866 --host 0.0.0.0     
INFO [                    main] build info | tid="140562966491136" timestamp=1723124595 build=3549 commit="afd27f01"
INFO [                    main] system info | tid="140562966491136" timestamp=1723124595 n_threads=10 n_threads_batch=-1 total_threads=10 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models Meta Llama Meta Llama 3.1 8B I...
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = models-meta-llama-Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = ./Meta-Llama-3.1-8B-Instruct-GGUF_ima...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = group_40.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 68
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Models Meta Llama Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 16128
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2016.00 MiB
llama_new_context_with_model: KV self size  = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    39.51 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="140562966491136" timestamp=1723124598 n_slots=1
INFO [                    init] new slot | tid="140562966491136" timestamp=1723124598 id_slot=0 n_ctx_slot=16128
INFO [                    main] model loaded | tid="140562966491136" timestamp=1723124598
INFO [                    main] chat template | tid="140562966491136" timestamp=1723124598 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=false
INFO [                    main] HTTP server listening | tid="140562966491136" timestamp=1723124598 n_threads_http="9" port="8866" hostname="0.0.0.0"
INFO [            update_slots] all slots are idle | tid="140562966491136" timestamp=1723124598
INFO [   launch_slot_with_task] slot is processing task | tid="140562966491136" timestamp=1723124772 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124772 id_slot=0 id_task=0 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124772 id_slot=0 id_task=0 p0=512
INFO [           print_timings] prompt eval time     =     391.14 ms /   907 tokens (    0.43 ms per token,  2318.87 tokens per second) | tid="140562966491136" timestamp=1723124773 id_slot=0 id_task=0 t_prompt_processing=391.138 n_prompt_tokens_processed=907 t_token=0.4312436604189636 n_tokens_second=2318.874668275647
INFO [           print_timings] generation eval time =    1003.99 ms /    74 runs   (   13.57 ms per token,    73.71 tokens per second) | tid="140562966491136" timestamp=1723124773 id_slot=0 id_task=0 t_token_generation=1003.985 n_decoded=74 t_token=13.567364864864865 n_tokens_second=73.70628047231781
INFO [           print_timings]           total time =    1395.12 ms | tid="140562966491136" timestamp=1723124773 id_slot=0 id_task=0 t_prompt_processing=391.138 t_token_generation=1003.985 t_total=1395.123
INFO [            update_slots] slot released | tid="140562966491136" timestamp=1723124773 id_slot=0 id_task=0 n_ctx=16128 n_past=980 n_system_tokens=0 n_cache_tokens=512 truncated=false
INFO [            update_slots] all slots are idle | tid="140562966491136" timestamp=1723124773
INFO [      log_server_request] request | tid="140561439375360" timestamp=1723124773 remote_addr="43.153.18.71" remote_port=57628 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=1024
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=1536
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=2048
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124774 id_slot=0 id_task=76 p0=2560
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124775 id_slot=0 id_task=76 p0=3072
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124775 id_slot=0 id_task=76 p0=3584
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124775 id_slot=0 id_task=76 p0=4096
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124775 id_slot=0 id_task=76 p0=4608
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124776 id_slot=0 id_task=76 p0=5120
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124776 id_slot=0 id_task=76 p0=5632
INFO [           print_timings] prompt eval time     =    2921.58 ms /  5756 tokens (    0.51 ms per token,  1970.17 tokens per second) | tid="140562966491136" timestamp=1723124778 id_slot=0 id_task=76 t_prompt_processing=2921.578 n_prompt_tokens_processed=5756 t_token=0.5075708825573315 n_tokens_second=1970.168176239005
INFO [           print_timings] generation eval time =    2037.78 ms /   133 runs   (   15.32 ms per token,    65.27 tokens per second) | tid="140562966491136" timestamp=1723124778 id_slot=0 id_task=76 t_token_generation=2037.779 n_decoded=133 t_token=15.321646616541353 n_tokens_second=65.26713642647215
INFO [           print_timings]           total time =    4959.36 ms | tid="140562966491136" timestamp=1723124778 id_slot=0 id_task=76 t_prompt_processing=2921.578 t_token_generation=2037.779 t_total=4959.357
INFO [            update_slots] slot released | tid="140562966491136" timestamp=1723124778 id_slot=0 id_task=76 n_ctx=16128 n_past=5888 n_system_tokens=0 n_cache_tokens=5632 truncated=false
INFO [            update_slots] all slots are idle | tid="140562966491136" timestamp=1723124778
INFO [      log_server_request] request | tid="140558945218560" timestamp=1723124778 remote_addr="43.153.18.71" remote_port=57638 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="140562966491136" timestamp=1723124778
INFO [   launch_slot_with_task] slot is processing task | tid="140562966491136" timestamp=1723124779 id_slot=0 id_task=222
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124779 id_slot=0 id_task=222 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124779 id_slot=0 id_task=222 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140562966491136" timestamp=1723124779 id_slot=0 id_task=222 p0=1024
^C^CReceived second interrupt, terminating immediately.


####NOT Normally:

ubuntu@VM-0-16-ubuntu:~/llama.cpp$ CUDA_VISIBLE_DEVICES=0 ./llama-server -m /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf  --gpu-layers 33 -cb --parallel 4 --ctx-size 16128    --flash-attn  --batch-size 512 --chat-template llama3  --port 8866 --host 0.0.0.0     
INFO [                    main] build info | tid="140411292143616" timestamp=1723125078 build=3549 commit="afd27f01"
INFO [                    main] system info | tid="140411292143616" timestamp=1723125078 n_threads=10 n_threads_batch=-1 total_threads=10 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 33 key-value pairs and 291 tensors from /home/ubuntu/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/1f301d86d760b435a11a56de3863bc0121bfb98f/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models Meta Llama Meta Llama 3.1 8B I...
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = models-meta-llama-Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = ./Meta-Llama-3.1-8B-Instruct-GGUF_ima...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = group_40.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 68
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Models Meta Llama Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 16128
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2016.00 MiB
llama_new_context_with_model: KV self size  = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.45 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    39.51 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="140411292143616" timestamp=1723125081 n_slots=4
INFO [                    init] new slot | tid="140411292143616" timestamp=1723125081 id_slot=0 n_ctx_slot=4032
INFO [                    init] new slot | tid="140411292143616" timestamp=1723125081 id_slot=1 n_ctx_slot=4032
INFO [                    init] new slot | tid="140411292143616" timestamp=1723125081 id_slot=2 n_ctx_slot=4032
INFO [                    init] new slot | tid="140411292143616" timestamp=1723125081 id_slot=3 n_ctx_slot=4032
INFO [                    main] model loaded | tid="140411292143616" timestamp=1723125081
INFO [                    main] chat template | tid="140411292143616" timestamp=1723125081 chat_example="<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" built_in=false
INFO [                    main] HTTP server listening | tid="140411292143616" timestamp=1723125081 n_threads_http="9" port="8866" hostname="0.0.0.0"
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125081
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125094 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125094 id_slot=0 id_task=0 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125094 id_slot=0 id_task=0 p0=512
INFO [           print_timings] prompt eval time     =     391.33 ms /   907 tokens (    0.43 ms per token,  2317.74 tokens per second) | tid="140411292143616" timestamp=1723125096 id_slot=0 id_task=0 t_prompt_processing=391.329 n_prompt_tokens_processed=907 t_token=0.4314542447629548 n_tokens_second=2317.7428710880104
INFO [           print_timings] generation eval time =    1014.42 ms /    74 runs   (   13.71 ms per token,    72.95 tokens per second) | tid="140411292143616" timestamp=1723125096 id_slot=0 id_task=0 t_token_generation=1014.416 n_decoded=74 t_token=13.708324324324325 n_tokens_second=72.94837620857714
INFO [           print_timings]           total time =    1405.75 ms | tid="140411292143616" timestamp=1723125096 id_slot=0 id_task=0 t_prompt_processing=391.329 t_token_generation=1014.416 t_total=1405.7450000000001
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125096 id_slot=0 id_task=0 n_ctx=16128 n_past=980 n_system_tokens=0 n_cache_tokens=512 truncated=false
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125096
INFO [      log_server_request] request | tid="140409762209792" timestamp=1723125096 remote_addr="43.153.18.71" remote_port=36174 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125096 id_slot=1 id_task=76
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125096 id_slot=1 id_task=76 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125096 id_slot=1 id_task=76 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125096 id_slot=1 id_task=76 p0=1024
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125096 id_slot=1 id_task=76 p0=1536
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125097 id_slot=1 id_task=76 p0=2048
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125097 id_slot=1 id_task=76 p0=2560
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125097 id_slot=1 id_task=76 p0=3072
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125097 id_slot=1 id_task=76 p0=3584
INFO [           print_timings] prompt eval time     =    1927.38 ms /  3740 tokens (    0.52 ms per token,  1940.46 tokens per second) | tid="140411292143616" timestamp=1723125099 id_slot=1 id_task=76 t_prompt_processing=1927.375 n_prompt_tokens_processed=3740 t_token=0.5153409090909091 n_tokens_second=1940.463065049614
INFO [           print_timings] generation eval time =    1145.59 ms /    76 runs   (   15.07 ms per token,    66.34 tokens per second) | tid="140411292143616" timestamp=1723125099 id_slot=1 id_task=76 t_token_generation=1145.587 n_decoded=76 t_token=15.073513157894737 n_tokens_second=66.34153495107748
INFO [           print_timings]           total time =    3072.96 ms | tid="140411292143616" timestamp=1723125099 id_slot=1 id_task=76 t_prompt_processing=1927.375 t_token_generation=1145.587 t_total=3072.962
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125099 id_slot=1 id_task=76 n_ctx=16128 n_past=3815 n_system_tokens=0 n_cache_tokens=3584 truncated=true
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125099
INFO [      log_server_request] request | tid="140409753817088" timestamp=1723125099 remote_addr="43.153.18.71" remote_port=36190 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125099
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125099 id_slot=2 id_task=161
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125099 id_slot=2 id_task=161 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125099 id_slot=2 id_task=161 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125099 id_slot=2 id_task=161 p0=1024
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125100 id_slot=2 id_task=161 p0=1536
INFO [           print_timings] prompt eval time     =    1117.78 ms /  1543 tokens (    0.72 ms per token,  1380.41 tokens per second) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=161 t_prompt_processing=1117.784 n_prompt_tokens_processed=1543 t_token=0.7244225534672716 n_tokens_second=1380.409810840019
INFO [           print_timings] generation eval time =   14567.23 ms /   906 runs   (   16.08 ms per token,    62.19 tokens per second) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=161 t_token_generation=14567.234 n_decoded=906 t_token=16.07862472406181 n_tokens_second=62.19437403147364
INFO [           print_timings]           total time =   15685.02 ms | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=161 t_prompt_processing=1117.784 t_token_generation=14567.234 t_total=15685.018
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=161 n_ctx=16128 n_past=2448 n_system_tokens=0 n_cache_tokens=1536 truncated=false
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125115
INFO [      log_server_request] request | tid="140409762209792" timestamp=1723125115 remote_addr="43.153.18.71" remote_port=36174 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125115
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=1072
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=1072 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=1072 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=1072 p0=1024
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125115 id_slot=2 id_task=1072 p0=1536
INFO [           print_timings] prompt eval time     =    1318.40 ms /  1781 tokens (    0.74 ms per token,  1350.88 tokens per second) | tid="140411292143616" timestamp=1723125123 id_slot=2 id_task=1072 t_prompt_processing=1318.401 n_prompt_tokens_processed=1781 t_token=0.7402588433464347 n_tokens_second=1350.8788297338974
INFO [           print_timings] generation eval time =    7254.65 ms /   450 runs   (   16.12 ms per token,    62.03 tokens per second) | tid="140411292143616" timestamp=1723125123 id_slot=2 id_task=1072 t_token_generation=7254.651 n_decoded=450 t_token=16.121446666666667 n_tokens_second=62.02917273346437
INFO [           print_timings]           total time =    8573.05 ms | tid="140411292143616" timestamp=1723125123 id_slot=2 id_task=1072 t_prompt_processing=1318.401 t_token_generation=7254.651 t_total=8573.052
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125123 id_slot=2 id_task=1072 n_ctx=16128 n_past=2230 n_system_tokens=0 n_cache_tokens=1536 truncated=false
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125123
INFO [      log_server_request] request | tid="140409745424384" timestamp=1723125123 remote_addr="43.153.18.71" remote_port=42030 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125123
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125123 id_slot=3 id_task=1527
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125123 id_slot=3 id_task=1527 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125124 id_slot=3 id_task=1527 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125124 id_slot=3 id_task=1527 p0=1024
INFO [           print_timings] prompt eval time     =    1365.03 ms /  1534 tokens (    0.89 ms per token,  1123.78 tokens per second) | tid="140411292143616" timestamp=1723125133 id_slot=3 id_task=1527 t_prompt_processing=1365.034 n_prompt_tokens_processed=1534 t_token=0.8898526727509779 n_tokens_second=1123.7815321816158
INFO [           print_timings] generation eval time =    7954.98 ms /   471 runs   (   16.89 ms per token,    59.21 tokens per second) | tid="140411292143616" timestamp=1723125133 id_slot=3 id_task=1527 t_token_generation=7954.977 n_decoded=471 t_token=16.889547770700638 n_tokens_second=59.208216441103474
INFO [           print_timings]           total time =    9320.01 ms | tid="140411292143616" timestamp=1723125133 id_slot=3 id_task=1527 t_prompt_processing=1365.034 t_token_generation=7954.977 t_total=9320.011
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125133 id_slot=3 id_task=1527 n_ctx=16128 n_past=2004 n_system_tokens=0 n_cache_tokens=1024 truncated=false
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125133
INFO [      log_server_request] request | tid="140409745424384" timestamp=1723125133 remote_addr="43.153.18.71" remote_port=42030 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125133
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125133 id_slot=0 id_task=2002
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125133 id_slot=0 id_task=2002 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125133 id_slot=0 id_task=2002 p0=512
INFO [           print_timings] prompt eval time     =     907.58 ms /   906 tokens (    1.00 ms per token,   998.26 tokens per second) | tid="140411292143616" timestamp=1723125135 id_slot=0 id_task=2002 t_prompt_processing=907.579 n_prompt_tokens_processed=906 t_token=1.001742825607064 n_tokens_second=998.2602065495125
INFO [           print_timings] generation eval time =    1150.37 ms /    68 runs   (   16.92 ms per token,    59.11 tokens per second) | tid="140411292143616" timestamp=1723125135 id_slot=0 id_task=2002 t_token_generation=1150.367 n_decoded=68 t_token=16.91716176470588 n_tokens_second=59.11157048142028
INFO [           print_timings]           total time =    2057.95 ms | tid="140411292143616" timestamp=1723125135 id_slot=0 id_task=2002 t_prompt_processing=907.579 t_token_generation=1150.367 t_total=2057.946
INFO [            update_slots] slot released | tid="140411292143616" timestamp=1723125135 id_slot=0 id_task=2002 n_ctx=16128 n_past=973 n_system_tokens=0 n_cache_tokens=512 truncated=false
INFO [            update_slots] all slots are idle | tid="140411292143616" timestamp=1723125135
INFO [      log_server_request] request | tid="140409745424384" timestamp=1723125135 remote_addr="43.153.18.71" remote_port=42030 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140411292143616" timestamp=1723125135 id_slot=1 id_task=2072
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125135 id_slot=1 id_task=2072 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125135 id_slot=1 id_task=2072 p0=512
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125136 id_slot=1 id_task=2072 p0=1024
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125136 id_slot=1 id_task=2072 p0=1536
INFO [            update_slots] kv cache rm [p0, end) | tid="140411292143616" timestamp=1723125137 id_slot=1 id_task=2072 p0=2048
^C^CReceived second interrupt, terminating immediately.
@hzgdeerHo hzgdeerHo added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Aug 8, 2024
@ngxson
Copy link
Collaborator

ngxson commented Aug 8, 2024

can you try with --ctx-size 16384 instead of --ctx-size 16128 ? (I'm not sure if it fixes the problem or not)

@hzgdeerHo
Copy link
Author

It does not work with --ctx-size 16384 ,but If I set like this : --ctx-size 32000 ,It works, I think it is related about the truncated process is enabled .How could I disabled the truncated process. THANKS !

@ngxson
Copy link
Collaborator

ngxson commented Aug 8, 2024

I'm not sure what you mean by "truncated process".

Keep in mind that the actual context size will be --ctx-size divided by --parallel, so for example with 16384 you have 16384 / 4096 = 4096 tokens per slot, so it's normal to increase ctx size if you set a high value for --parallel

@hzgdeerHo
Copy link
Author

THANKS!

@github-actions github-actions bot added the stale label Sep 9, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale
Projects
None yet
Development

No branches or pull requests

2 participants