When and only when running a GGUF format Qwen model (e.g., Qwen-Math, Qwen-1M), LocalAI gets confusing output.

Hello. Thank you for your outstanding work. 

**LocalAI version:**
localai/localai:master-cublas-cuda12-ffmpeg (Latest Version)

**Environment, CPU architecture, OS, and Version:**
Linux 102 6.1.0-30-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.124-1 (2025-01-12) x86_64 GNU/Linux
Linux KVM
Cuda: 12.4

**Describe the bug**
When and only when running a GGUF format Qwen model (e.g., Qwen-Math, Qwen-1M), LocalAI gets confusing output from calling llama.cpp with almost the same parameters, whereas it is normal to reason directly with llama.cpp.

### Different providers and quantization models
`Qwen/Qwen2.5-14B-Instruct-1M`: Q5KM, Q6K,Q8_0,F16 quantized by llama.cpp
`unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF`: Q5KM, Q6K
`deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`: Q5KM, Q6K,Q8_0,F16 quantized by llama.cpp
`Qwen/Qwen2.5-Math-7B-Instruct`: Q5KM, Q6K,Q8_0,F16 quantized by llama.cpp

### LocalAI for `Qwen/Qwen2.5-14B-Instruct-1M`,
```
curl http://192.168.0.102:10000/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer sk-test"   -d '{
     "model": "Qwen",
     "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Introduce GPT in depth."}]
   }'
```
It shows that:
```
\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e\u003c|im_start|\u003e......
```
If I don't set the `system` role, it seems that the output is a bit more normal, but still with formatting errors and lower quality than llama.cpp.

### llama.cpp inference for `Qwen/Qwen2.5-14B-Instruct-1M`:
```
./llama-cli -m /data/models/Qwen2.5-14B-Instruct-1m-Q6_K.gguf --gpu-layers 65 --main-gpu 0 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --split-mode row -p "You are a helpful assistant." 
```
Output:
```
Certainly! GPT, which stands for Generative Pre-trained Transformer, is a series of large-scale language models developed by OpenAI. These models are designed to generate human-like text based on the input they receive. Here's a more in-depth look at GPT:\n\n### Architecture\nGPT models are based on the transformer architecture, which was introduced in the paper \"Attention is All You Need\" by Vaswani et al. in 2017. The transformer architecture uses self-attention mechanisms to process input data, allowing the model to focus on different parts of the input when generating output. This architecture is particularly effective for handling sequential data like text, as it can capture long-range dependencies and context.\n\n### Pre-training\nThe GPT models are pre-trained on large corpora of text data, which allows them to learn general language patterns and representations. This pre-training process involves predicting the next word in a sequence, given the previous words. The vast amount of data used in this process (often in the order of billions of words) enables the model to learn a wide range of language patterns and structures.\n\n### Fine-tuning\nAfter pre-training, GPT models can be fine-tuned on specific tasks or datasets. Fine-tuning involves training the model on a smaller, task-specific dataset, allowing it to adapt to the nuances of the particular task at hand. This process can significantly improve the model's performance on tasks like language translation, text summarization, and question answering.\n\n### Capabilities\nGPT models are capable of generating coherent and contextually relevant text. They can be used for a variety of applications, including:\n- **Text Generation**: Creating new text based on a given prompt.\n- **Translation**: Translating text from one language to another.\n- **Summarization**: Summarizing long pieces of text into shorter, more concise versions.\n- **Question Answering**: Providing answers to questions based on given context.\n- **Dialogue Systems**: Creating conversational agents that can engage in meaningful dialogue with users.\n\n### Limitations\nWhile GPT models are powerful, they also have limitations:\n- **Bias**: The models can sometimes generate biased or inappropriate content, reflecting the biases present in the training data.\n- **Factuality**: The generated text may not always be factually accurate, as the model does not have access to real-time information.\n- **Overfitting**: There is a risk of overfitting to the training data, especially if the model is too large or the training data is not diverse enough.\n\n### Ethical Considerations\nGiven the capabilities of GPT models, there are important ethical considerations to keep in mind:\n- **Misuse**: The models can be used to generate misleading or harmful content.\n- **Privacy**: The models may inadvertently reveal sensitive information if trained on datasets containing such information.\n- **Fairness**: Ensuring that the models do not perpetuate or exacerbate existing biases is a critical concern.\n\n### Future Directions\nThe development of GPT models is ongoing, with newer versions (like GPT-3 and beyond) continually improving in terms of size, performance, and capabilities. Future research may focus on addressing the limitations of these models, improving their ethical considerations, and exploring new applications in areas like healthcare, education, and creative writing.\n\nIn summary, GPT models represent a significant advancement in the field of natural language processing, offering powerful tools for generating and understanding human language. However, their use must be approached with careful consideration of the ethical and practical implications.
```

Qwen-Math behaves similarly. For deepseek-r1-qwen, localai can output smooth sentences, but formats such as special token are broken, e.g. not closed. Also the performance is much lower than llama.cpp.

Note: the output of llama.cpp is always correct, with or without system role.

**To Reproduce**
Below are the parameters I have set so far, but according to my tests, the problem still exists even if I use GGUF directly or change some of these parameters.
```
name: Qwen
backend: llama
threads: 4
mmap: true
cuda: true

context_size: 128000
parameters:
  model: Qwen2.5-14B-Instruct-1m-Q6_K.gguf
  temperature: 0.1
  top_p: 0.95
  top_k: 40
  min_p: 0.05
  typical_p: 1.0
  repeat_last_n: 64
  repeat_penalty: 1.0
  presence_penalty: 0.0
  frequency_penalty: 0.0
  dynatemp_range: 0.0
  dynatemp_exponent: 1.0

# LocalAI/core/config/backend_config.go
no_kv_offloading: false
cache_type_k: q8_0
cache_type_v: q8_0
flash_attention: true

template:
  use_tokenizer_template: true
```

**Expected behavior**
I tried to control some variables as much as possible, but I still can't find where the error is. I'm guessing this may be a combination of issues related to chat templates, reinforcement learning for the Qwen series (not all Qwen models do this), and LocalAI's default parameters with llama.cpp. Perhaps I'm overlooking some subtle differences between the behavior of LocalAI vs llama.cpp. I hope the output of LocalAI is correct, similar to llama.cpp.

**Logs**
[localai_20250202.log](https://github.com/user-attachments/files/18632651/localai_20250202.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

When and only when running a GGUF format Qwen model (e.g., Qwen-Math, Qwen-1M), LocalAI gets confusing output. #4734

Different providers and quantization models

LocalAI for `Qwen/Qwen2.5-14B-Instruct-1M`,

llama.cpp inference for `Qwen/Qwen2.5-14B-Instruct-1M`:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

When and only when running a GGUF format Qwen model (e.g., Qwen-Math, Qwen-1M), LocalAI gets confusing output. #4734

Description

Different providers and quantization models

LocalAI for Qwen/Qwen2.5-14B-Instruct-1M,

llama.cpp inference for Qwen/Qwen2.5-14B-Instruct-1M:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

LocalAI for `Qwen/Qwen2.5-14B-Instruct-1M`,

llama.cpp inference for `Qwen/Qwen2.5-14B-Instruct-1M`: