Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
llama.cpp was cloned and compiled on 11/25/2023
main: build = 1550 (8e672ef)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
pip list
shows llama_cpp_python == 0.2.19
- I carefully followed the README.md.
I installed it using these commands to ensure arm64 targets (I think?):
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U git+https://github.com/abetlen/llama-cpp-python.git --no-cache-dir
and
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U --no-cache-dir 'llama-cpp-python[server]'
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Running the server module using similar command line options as using llama.cpp's main executable should result in similar startup behavior
Current Behavior
This will crash:
export HOST=0.0.0.0 && python -m llama_cpp.server --model wizardcoder-python-34b-v1.0.Q4_K_M.gguf --n_gpu_layers 1 --n_ctx 16384
because it seems to have trouble allocating memory to the buffer:
llm_load_tensors: mem required = 19282.65 MiB
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MiB
ggml_metal_add_buffer: error: failed to allocate 'data ' buffer, size = 19283.21 MiB
meanwhile, this works fine:
./main -m wizardcoder-python-34b-v1.0.Q4_K_M.gguf -ngl 1 --ctx_size 16384
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MiB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size = 205.08 MiB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 19283.22 MiB, (19283.84 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 3072.02 MiB, (22355.86 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 2128.02 MiB, (24483.88 / 49152.00)
Environment and Context
Mac M1 Max Processor with 64GB of ram, running MacOS Sonoma (14.1.1)
Using Conda created virtual environment (llama.cpp)
$ python3 --version
Python 3.10.13
$ make --version
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
This program built for i386-apple-darwin11.3.0
$ g++ --version
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Failure Information (for bugs)
Failed execution using server module:
$ export HOST=0.0.0.0 && python -m llama_cpp.server --model wizardcoder-python-34b-v1.0.Q4_K_M.gguf --n_gpu_layers 1 --n_ctx 16384
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q4_K_M.gguf (version GGUF V2)
...
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 48
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 22016
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32001] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32001] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32001] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32001
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 33.74 B
llm_load_print_meta: model size = 18.83 GiB (4.79 BPW)
llm_load_print_meta: general.name = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: mem required = 19282.65 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 3072.00 MiB
llama_build_graph: non-view tensors processed: 1108/1108
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MiB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size = 205.08 MiB
ggml_metal_add_buffer: error: failed to allocate 'data ' buffer, size = 19283.21 MiB
llama_new_context_with_model: failed to add buffer
ggml_metal_free: deallocating
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/server/__main__.py", line 96, in <module>
app = create_app(settings=settings)
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/server/app.py", line 380, in create_app
llama = llama_cpp.Llama(
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 924, in __init__
self._n_ctx = self.n_ctx()
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 2176, in n_ctx
return self._ctx.n_ctx()
File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 428, in n_ctx
assert self.ctx is not None
AssertionError
working version using main:
$ ./main -m wizardcoder-python-34b-v1.0.Q4_K_M.gguf -ngl 1 --ctx_size 16384
Log start
main: build = 1550 (8e672ef)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
main: seed = 1700951976
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q4_K_M.gguf (version GGUF V2)
...
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 48
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 22016
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32001] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32001] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32001] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32001
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 33.74 B
llm_load_print_meta: model size = 18.83 GiB (4.79 BPW)
llm_load_print_meta: general.name = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.16 MiB
llm_load_tensors: mem required = 19282.65 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 3072.00 MiB
llama_build_graph: non-view tensors processed: 1108/1108
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/Ian/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MiB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size = 205.08 MiB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 19283.22 MiB, (19283.84 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 3072.02 MiB, (22355.86 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 2128.02 MiB, (24483.88 / 49152.00)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 16384, n_batch = 512, n_predict = -1, n_keep = 0
#!/usr/bin/python3
Doing a side-by-side comparison of the outputs, it looks like all of the parameters match exactly except for the location of the ggml-metal.metal file. I did a diff -b
against the two different files and they're identical.
I had to truncate the output of both executions due to reaching github's 65536 character limit.