Different gpu buffer behavior using llama-cpp-python[server] vs llama-cpp

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
llama.cpp was cloned and compiled on 11/25/2023
```
main: build = 1550 (8e672ef)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
```
`pip list` shows llama_cpp_python == 0.2.19

- [X] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
I installed it using these commands to ensure arm64 targets (I think?):
`CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U git+https://github.com/abetlen/llama-cpp-python.git --no-cache-dir`
and
`CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U --no-cache-dir 'llama-cpp-python[server]'`
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

Running the server module using similar command line options as using llama.cpp's main executable should result in similar startup behavior

# Current Behavior

This will crash:
`export HOST=0.0.0.0 && python -m llama_cpp.server --model wizardcoder-python-34b-v1.0.Q4_K_M.gguf --n_gpu_layers 1 --n_ctx 16384`

because it seems to have trouble allocating memory to the buffer:
```
llm_load_tensors: mem required  = 19282.65 MiB
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MiB
ggml_metal_add_buffer: error: failed to allocate 'data            ' buffer, size = 19283.21 MiB
```

meanwhile, this works fine:
`./main -m wizardcoder-python-34b-v1.0.Q4_K_M.gguf -ngl 1 --ctx_size 16384`

```
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size =   205.08 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 19283.22 MiB, (19283.84 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  3072.02 MiB, (22355.86 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  2128.02 MiB, (24483.88 / 49152.00)
```

# Environment and Context

Mac M1 Max Processor with 64GB of ram, running MacOS Sonoma (14.1.1)
Using Conda created virtual environment (llama.cpp)

```
$ python3 --version
Python 3.10.13
$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0
$ g++ --version
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```

# Failure Information (for bugs)

Failed execution using server module:
```
$ export HOST=0.0.0.0 && python -m llama_cpp.server --model wizardcoder-python-34b-v1.0.Q4_K_M.gguf --n_gpu_layers 1 --n_ctx 16384

llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q4_K_M.gguf (version GGUF V2)
...
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32001]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32001]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32001]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 18.83 GiB (4.79 BPW) 
llm_load_print_meta: general.name   = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors: mem required  = 19282.65 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 3072.00 MiB
llama_build_graph: non-view tensors processed: 1108/1108
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size =   205.08 MiB
ggml_metal_add_buffer: error: failed to allocate 'data            ' buffer, size = 19283.21 MiB
llama_new_context_with_model: failed to add buffer
ggml_metal_free: deallocating
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Traceback (most recent call last):
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/server/__main__.py", line 96, in <module>
    app = create_app(settings=settings)
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/server/app.py", line 380, in create_app
    llama = llama_cpp.Llama(
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 924, in __init__
    self._n_ctx = self.n_ctx()
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 2176, in n_ctx
    return self._ctx.n_ctx()
  File "/Users/Ian/opt/miniconda3/envs/llama.cpp/lib/python3.10/site-packages/llama_cpp/llama.py", line 428, in n_ctx
    assert self.ctx is not None
AssertionError
```

working version using main:
```
$ ./main -m wizardcoder-python-34b-v1.0.Q4_K_M.gguf -ngl 1 --ctx_size 16384

Log start
main: build = 1550 (8e672ef)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.1.0
main: seed  = 1700951976
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q4_K_M.gguf (version GGUF V2)
...
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32001]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32001]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32001]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 18.83 GiB (4.79 BPW)
llm_load_print_meta: general.name   = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.16 MiB
llm_load_tensors: mem required  = 19282.65 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 3072.00 MiB
llama_build_graph: non-view tensors processed: 1108/1108
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/Ian/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 2131.07 MiB
llama_new_context_with_model: max tensor size =   205.08 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 19283.22 MiB, (19283.84 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  3072.02 MiB, (22355.86 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  2128.02 MiB, (24483.88 / 49152.00)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 16384, n_batch = 512, n_predict = -1, n_keep = 0


 #!/usr/bin/python3
```

Doing a side-by-side comparison of the outputs, it looks like all of the parameters match exactly except for the location of the ggml-metal.metal file.  I did a `diff -b` against the two different files and they're identical.

I had to truncate the output of both executions due to reaching github's 65536 character limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different gpu buffer behavior using llama-cpp-python[server] vs llama-cpp #948

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Different gpu buffer behavior using llama-cpp-python[server] vs llama-cpp #948

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions