Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Kompute exits before loading model when offloading to GPU #8932

Open
mi4code opened this issue Aug 8, 2024 · 4 comments
Open

Bug: Kompute exits before loading model when offloading to GPU #8932

mi4code opened this issue Aug 8, 2024 · 4 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@mi4code
Copy link

mi4code commented Aug 8, 2024

What happened?

I wanted to use the Kompute version to run on my GPU (Radeon RX570 4G) but whenever i use the -ngl argument to offload to GPU, llama-cli silently exits before loading the model. When I tried to run the exact same command in MSYS2 mingw environment, I got same result (same log output) + Segmentation fault message, so I assumed thats whats happening.

The same model runs fine on my GPU with GPT4All (which, from what I understood, uses same backend).

Output of vulkaninfo --summary:

here
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.204


Instance Extensions: count = 11
-------------------------------
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_win32_surface                   : extension revision 6

Instance Layers: count = 3
--------------------------
VK_LAYER_AMD_switchable_graphics AMD switchable graphics layer 1.3.217  version 1
VK_LAYER_VALVE_steam_fossilize   Steam Pipeline Caching Layer  1.2.136  version 1
VK_LAYER_VALVE_steam_overlay     Steam Overlay Layer           1.2.136  version 1

Devices:
========
GPU0:
        apiVersion         = 4206809 (1.3.217)
        driverVersion      = 8388841 (0x8000e9)
        vendorID           = 0x1002
        deviceID           = 0x67df
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = Radeon RX 570 Series
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 22.20.27.09
        conformanceVersion = 1.3.0.0
        deviceUUID         = 00000000-0600-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000

Name and Version

llama-cli --version
version: 3547 (e44a561)
built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

[1723122079] Log start
[1723122079] Cmd: llama-cli -m ../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_0.gguf -n 512 --color -ngl 10 -i --interactive-first --in-prefix <|im_end|>\n<|im_start|>user\n --in-suffix <|im_end|>\n<|im_start|>assistant\n --reverse-prompt <|im_end|> -p "<|im_start|>system\nYou are a helpful assistant<|im_end|>"
[1723122079] main: build = 3547 (e44a561a)
[1723122079] main: built with MSVC 19.29.30154.0 for x64
[1723122079] main: seed  = 1723122079
[1723122079] main: llama backend init
[1723122079] main: load the model and apply lora adapter, if any
[1723122079] llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from ../llama.cpp/models/Phi-3-mini-4k-instruct.Q4_0.gguf (version GGUF V3 (latest))
[1723122079] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[1723122079] llama_model_loader: - kv   0:                       general.architecture str              = llama
[1723122079] llama_model_loader: - kv   1:                               general.name str              = phi3
[1723122079] llama_model_loader: - kv   2:                          llama.block_count u32              = 32
[1723122079] llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
[1723122079] llama_model_loader: - kv   4:                     llama.embedding_length u32              = 3072
[1723122079] llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 8192
[1723122079] llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
[1723122079] llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 32
[1723122079] llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 10000.000000
[1723122079] llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[1723122079] llama_model_loader: - kv  10:                          general.file_type u32              = 2
[1723122079] llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32064
[1723122079] llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 96
[1723122079] llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
[1723122079] llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
[1723122079] llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32064]   = [0.000000, 0.000000, 0.000000, 0.0000...
[1723122079] llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32064]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
[1723122079] llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
[1723122079] llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 32007
[1723122079] llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
[1723122079] llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 32000
[1723122079] llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
[1723122079] llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
[1723122079] llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
[1723122079] llama_model_loader: - kv  24:               general.quantization_version u32              = 2
[1723122079] llama_model_loader: - type  f32:   65 tensors
[1723122079] llama_model_loader: - type q4_0:  225 tensors
[1723122079] llama_model_loader: - type q6_K:    1 tensors
[1723122079] llm_load_vocab: special tokens cache size = 67
[1723122079] llm_load_vocab: token to piece cache size = 0.1691 MB
[1723122079] llm_load_print_meta: format           = GGUF V3 (latest)
[1723122079] llm_load_print_meta: arch             = llama
[1723122079] llm_load_print_meta: vocab type       = SPM
[1723122079] llm_load_print_meta: n_vocab          = 32064
[1723122079] llm_load_print_meta: n_merges         = 0
[1723122079] llm_load_print_meta: vocab_only       = 0
[1723122079] llm_load_print_meta: n_ctx_train      = 4096
[1723122079] llm_load_print_meta: n_embd           = 3072
[1723122079] llm_load_print_meta: n_layer          = 32
[1723122079] llm_load_print_meta: n_head           = 32
[1723122079] llm_load_print_meta: n_head_kv        = 32
[1723122079] llm_load_print_meta: n_rot            = 96
[1723122079] llm_load_print_meta: n_swa            = 0
[1723122079] llm_load_print_meta: n_embd_head_k    = 96
[1723122079] llm_load_print_meta: n_embd_head_v    = 96
[1723122079] llm_load_print_meta: n_gqa            = 1
[1723122079] llm_load_print_meta: n_embd_k_gqa     = 3072
[1723122079] llm_load_print_meta: n_embd_v_gqa     = 3072
[1723122079] llm_load_print_meta: f_norm_eps       = 0.0e+00
[1723122079] llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
[1723122079] llm_load_print_meta: f_clamp_kqv      = 0.0e+00
[1723122079] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[1723122079] llm_load_print_meta: f_logit_scale    = 0.0e+00
[1723122079] llm_load_print_meta: n_ff             = 8192
[1723122079] llm_load_print_meta: n_expert         = 0
[1723122079] llm_load_print_meta: n_expert_used    = 0
[1723122079] llm_load_print_meta: causal attn      = 1
[1723122079] llm_load_print_meta: pooling type     = 0
[1723122079] llm_load_print_meta: rope type        = 0
[1723122079] llm_load_print_meta: rope scaling     = linear
[1723122079] llm_load_print_meta: freq_base_train  = 10000.0
[1723122079] llm_load_print_meta: freq_scale_train = 1
[1723122079] llm_load_print_meta: n_ctx_orig_yarn  = 4096
[1723122079] llm_load_print_meta: rope_finetuned   = unknown
[1723122079] llm_load_print_meta: ssm_d_conv       = 0
[1723122079] llm_load_print_meta: ssm_d_inner      = 0
[1723122079] llm_load_print_meta: ssm_d_state      = 0
[1723122079] llm_load_print_meta: ssm_dt_rank      = 0
[1723122079] llm_load_print_meta: model type       = 7B
[1723122079] llm_load_print_meta: model ftype      = Q4_0
[1723122079] llm_load_print_meta: model params     = 3.82 B
[1723122079] llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
[1723122079] llm_load_print_meta: general.name     = phi3
[1723122079] llm_load_print_meta: BOS token        = 1 '<s>'
[1723122079] llm_load_print_meta: EOS token        = 32007 '<|end|>'
[1723122079] llm_load_print_meta: UNK token        = 0 '<unk>'
[1723122079] llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
[1723122079] llm_load_print_meta: LF token         = 13 '<0x0A>'
[1723122079] llm_load_print_meta: EOT token        = 32007 '<|end|>'
[1723122079] llm_load_print_meta: max token length = 48
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llama_default_buffer_type_offload: cannot use GPU 0, check `vulkaninfo --summary`
[1723122079] llm_load_tensors: ggml ctx size =    0.14 MiB
[1723122079] llm_load_tensors: offloading 10 repeating layers to GPU
[1723122079] llm_load_tensors: offloaded 10/33 layers to GPU
[1723122079] llm_load_tensors:        CPU buffer size =  2074.66 MiB
[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] .[1723122079] 
[1723122079] llama_new_context_with_model: n_ctx      = 4096
[1723122079] llama_new_context_with_model: n_batch    = 2048
[1723122079] llama_new_context_with_model: n_ubatch   = 512
[1723122079] llama_new_context_with_model: flash_attn = 0
[1723122079] llama_new_context_with_model: freq_base  = 10000.0
[1723122079] llama_new_context_with_model: freq_scale = 1
[1723122079] llama_kv_cache_init:        CPU KV buffer size =  1536.00 MiB
[1723122079] llama_new_context_with_model: KV self size  = 1536.00 MiB, K (f16):  768.00 MiB, V (f16):  768.00 MiB
[1723122079] llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
@mi4code mi4code added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 8, 2024
@bdj34
Copy link

bdj34 commented Aug 8, 2024

I get the same type of error (sometimes bus error, sometimes seg fault) when offloading to GPU on M2 Max. Doesn't happen every time though and doesn't happen running CPU only inference.

@mi4code
Copy link
Author

mi4code commented Sep 7, 2024

For anyone having the same issue:
Go to NomicAI's fork at https://github.com/nomic-ai/llama.cpp and download kompute version from releases there.
(worked perfectly for me)

@github-actions github-actions bot added the stale label Oct 8, 2024
@dermotfix
Copy link

I get that same vulkaninfo message with nomic-ai's fork (latest version) on Android 12 and a vulkan compliant gpu

@github-actions github-actions bot removed the stale label Oct 13, 2024
@dermotfix
Copy link

$ /data/local/tmp/llama-cli -m /sdcard/models/gg.gguf -ngl 999 --prompt "she once told me in bed"
build: 3868 (58a55ef) with clang version 19.1.1 for armv7a-unknown-linux-android24
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from /sdcard/models/gg.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_0: 155 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 606.53 MiB (4.63 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
[Oct 11 2024 18:46:29] [debug] [Manager.cpp:157] Kompute Manager creating instance
[Oct 11 2024 18:46:29] [debug] [Manager.cpp:260] Kompute Manager Instance Created
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llama_default_buffer_type_offload: cannot use GPU 0, check vulkaninfo --summary
llm_load_tensors: ggml ctx size = 0.07 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: CPU buffer size = 606.53 MiB
.....................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 44.00 MiB
llama_new_context_with_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 148.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 1027837595
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 1

she once told me in bed, and I would not dream of sleeping with anyone else, least of all an ex.

I had
llama_perf_sampler_print: sampling time = 4.85 ms / 30 runs ( 0.16 ms per token, 6189.40 tokens per second)
llama_perf_context_print: load time = 3473.57 ms
llama_perf_context_print: prompt eval time = 4282.63 ms / 7 tokens ( 611.80 ms per token, 1.63 tokens per second)
llama_perf_context_print: eval time = 15772.93 ms / 22 runs ( 716.95 ms per token, 1.39 tokens per second)
llama_perf_context_print: total time = 20663.09 ms / 29 tokens

(debug log level)
I get the same output in Termux and adb shell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

3 participants