Huge performance discrepency between llama-cpp-python and llama.cpp

### Summary:
When testing the latest version of llama-cpp-python (0.1.64) alongside [the corresponding commit of llama.cpp](https://github.com/ggerganov/llama.cpp/tree/8596af427722775f0df4a7c90b9af067ba90d4ef), I observed that llama.cpp performs significantly faster than llama-cpp-python in terms of total time taken to execute. Additionally, GPU utilization is consistently higher for llama.cpp compared to llama-cpp-python.

### Environment:
- Processor: AMD Ryzen 5 5600
- GPU: NVIDIA 4090 (Single)
- OS: Ubuntu 22.04
- Python: 3.10.9
- GNU Make: 4.3
- Compiler: g++ 11.3.0

### Background
First, I updated the textgen-webui requirement to include the latest version of llama-cpp-python (0.1.64) manually. After installing the update, I ran tests and saw that the speed improved, but it was still much slower than llama.cpp.

To focus on llama-cpp-python's role, I wrote code to test llama-cpp-python separately.

### Steps to Reproduce:
#### llama-cpp-python
1.  ```
    pip uninstall -y llama-cpp-python
    CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.64 --no-cache-dir
    ```
2. ` conda list llama-cpp-python` 
make sure the version is `0.1.64`
3. Write a test.py file using the following code, change the model file to any GGML model on your local machine, then `Python test.py`.
```
import sys
from llama_cpp import Llama

params = {
    'model_path': "/home/wsy/Projects/text-generation-webui/models/guanaco-33B.ggmlv3.q4_K_M.bin",
    'n_ctx': 1024,
    'seed': 4,
    'n_threads': 1,
    'n_batch': 256,
    'n_gpu_layers': 128
}

llm = Llama(**params)

stream = llm(
    "Write an essay about american history",
    max_tokens=1000,
    stream=True,
)

for output in stream:
    print(output['choices'][0]['text'], end ='')
    sys.stdout.flush()
```
#### llama.cpp
1.  Go to llama.cpp folder, 
    ```
    git pull
    git checkout 8596af427722775f0df4a7c90b9af067ba90d4ef
    make clean
    make LLAMA_CUBLAS=1
    ```
2. Run llama.cpp with the exact same parameters using the following command:
    ```
    ./main -m ../models/guanaco-33B.ggmlv3.q4_K_M.bin -p "Write an essay about american history" -ngl 128 -s 4 -n 1000 -t 1 --ctx-size 1024 --batch-size 256
    ```

### Expected Outcome:
Similar performance and GPU utilization between llama-cpp-python and llama.cpp.

### Actual Outcome:
Output of llama-cpp-python:
```
llama_print_timings:        load time =   450.16 ms
llama_print_timings:      sample time =   412.64 ms /  1000 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   450.12 ms /     9 tokens (   50.01 ms per token)
llama_print_timings:        eval time = 30622.88 ms /   999 runs   (   30.65 ms per token)
llama_print_timings:       total time = 39541.67 ms
```

Output of llama.cpp:
```
llama_print_timings:        load time =  2480.53 ms
llama_print_timings:      sample time =   426.18 ms /  1000 runs   (    0.43 ms per token)
llama_print_timings: prompt eval time =   447.96 ms /     9 tokens (   49.77 ms per token)
llama_print_timings:        eval time = 29871.26 ms /   999 runs   (   29.90 ms per token)
llama_print_timings:       total time = 30938.72 ms
```

- llama.cpp had a total execution time that was almost 9 seconds faster than llama-cpp-python (about 28% faster).
- GPU utilization was constant at around 93% for llama.cpp, while it started at around 80% and gradually dropped to below 60% for llama-cpp-python, which might be indicative of the performance discrepancy.
-  In llama-cpp-python, the `total time` is significantly larger than the sum of `sample time + prompt eval time + eval time`. In contrast, these times are consistent for llama.cpp.

### Updated Findings
I conducted more tests and discovered additional facts that could be useful in solving the problem:
- 0.1.64 is significantly faster than 0.1.63, which indicates that the llama.cpp code has indeed been updated.
- Earlier versions, such as 0.1.61, also suffer from the `total time != sample time + prompt eval time + eval time` issue.

It seems that the problem has existed for quite some time. When llama.cpp was slow, it wasn't very noticeable, but now that llama.cpp is fast, it is much more evident.

I would appreciate it if this performance discrepancy could be investigated and addressed. Thank you!  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Summary:

Environment:

Background

Steps to Reproduce:

llama-cpp-python

llama.cpp

Expected Outcome:

Actual Outcome:

Updated Findings

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Description

Summary:

Environment:

Background

Steps to Reproduce:

llama-cpp-python

llama.cpp

Expected Outcome:

Actual Outcome:

Updated Findings

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions