Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llama-server with --system-prompt-file stops abruptly without any error #8975

Closed
pritam-dey3 opened this issue Aug 10, 2024 · 2 comments · Fixed by #8987
Closed

Bug: llama-server with --system-prompt-file stops abruptly without any error #8975

pritam-dey3 opened this issue Aug 10, 2024 · 2 comments · Fixed by #8987
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@pritam-dey3
Copy link

pritam-dey3 commented Aug 10, 2024

What happened?

following @ggerganov 's advice here I tried to run server with the following command

docker run -v /home/pritam/llm/models:/models -v /home/pritam/lipika/:/lipika -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-2b-it-Q4_K_M.gguf --port 8080 --host 0.0.0.0 --system-prompt-file /lipika/system.txt --logdir /lipika/logs --log-enable --log-new

The system.txt file contains around 2.3k tokens. Shortly after starting the server, the process stops abruptly without showing any error. I mentioned logdir and enabled logs but no logs were generated.

This is the the full output before the server stops

docker run -v /home/pritam/llm/models:/models -v /home/pritam/lipika/:/lipika -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-2b-it-Q4_K_M.gguf --port 8080 --host 0.0.0.0 --system-prompt-file /lipika/system.txt --logdir /lipika/logs --log-enable --log-new
INFO [                    main] build info | tid="140734278524576" timestamp=1723315495 build=0 commit="unknown"
INFO [                    main] system info | tid="140734278524576" timestamp=1723315495 n_threads=4 n_threads_batch=-1 total_threads=4 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from /models/gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-2
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["conversational", "text-generation"]
llama_model_loader: - kv   8:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   9:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv  10:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  11:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  12:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  13:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  16:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  19:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  20:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/gemma-2-2b-it-GGUF/gemma-...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 182
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.59 GiB (5.21 BPW)
llm_load_print_meta: general.name     = Gemma 2 2b It
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors:        CPU buffer size =  1623.67 MiB
..........................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.95 MiB
llama_new_context_with_model:        CPU compute buffer size =   504.50 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 1
INFO [                    init] initializing slots | tid="140734278524576" timestamp=1723315496 n_slots=1
INFO [                    init] new slot | tid="140734278524576" timestamp=1723315496 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="140734278524576" timestamp=1723315496
INFO [                    main] chat template | tid="140734278524576" timestamp=1723315496 chat_example="<start_of_turn>user\nYou are a helpful assistant\n\nHello<end_of_turn>\n<start_of_turn>model\nHi there<end_of_turn>\n<start_of_turn>user\nHow are you?<end_of_turn>\n<start_of_turn>model\n" built_in=true
INFO [                    main] HTTP server listening | tid="140734278524576" timestamp=1723315496 n_threads_http="3" port="8080" hostname="0.0.0.0"
VERB [              start_loop] new task may arrive | tid="140732260343456" timestamp=1723367556
VERB [              start_loop] update_multitasks | tid="140732260343456" timestamp=1723367556
VERB [              start_loop] callback_update_slots | tid="140732260343456" timestamp=1723367556
VERB [    system_prompt_update] system prompt update | tid="140732260343456" timestamp=1723367556 system_prompt="You are..."
VERB [          kv_cache_clear] clearing KV cache | tid="140732260343456" timestamp=1723367556

If I don't mention the system-prompt file, the server runs just fine.

Name and Version

I pulled the most recent version of ghcr.io/ggerganov/llama.cpp:server and tested it in my laptop and raspberry pi 5. In both cases the server stopped after about 30 seconds when system.txt was provided.

windows laptop:

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

raspberry pi 5:

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

What operating system are you seeing the problem on?

Windows, Other? (Please let us know in description)

Relevant log output

No response

@pritam-dey3 pritam-dey3 added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Aug 10, 2024
@pritam-dey3
Copy link
Author

pritam-dey3 commented Aug 11, 2024

I tried to do some debugging and it seems this line is causing a seg fault. (Segmentation fault (core dumped))
I am too weak in cpp to figure out why

Edit: The default maximum batch_size is 2048 whereas my system.txt token length was 2330, which caused the segmentation fault

@pritam-dey3
Copy link
Author

pritam-dey3 commented Aug 11, 2024

@compilade can you please tell me why we are setting n_tokens = n_batch in this line?

In this case since I provided a system prompt more than n_batch it caused segmentation fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant