diff --git a/README.md b/README.md
index 542d78ba7d5..618b0148881 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@
**`IPEX-LLM`** is a PyTorch library for running **LLM** on Intel CPU and GPU *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)* with very low latency[^1].
> [!NOTE]
> - *It is built on top of the excellent work of **`llama.cpp`**, **`transformers`**, **`bitsandbytes`**, **`vLLM`**, **`qlora`**, **`AutoGPTQ`**, **`AutoAWQ`**, etc.*
-> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HF-Transformers-AutoModels), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.*
+> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HuggingFace), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.*
> - ***50+ models** have been optimized/verified on `ipex-llm` (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list [here](#verified-models).*
## Latest Update 🔥
@@ -23,20 +23,20 @@
- [2024/04] You can now run **Open WebUI** on Intel GPU using `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md).
- [2024/04] You can now run **Llama 3** on Intel GPU using `llama.cpp` and `ollama` with `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md).
-- [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3).
+- [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3).
- [2024/04] `ipex-llm` now provides C++ interface, which can be used as an accelerated backend for running [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md) and [ollama](docs/mddocs/Quickstart/ollama_quickstart.md) on Intel GPU.
- [2024/03] `bigdl-llm` has now become `ipex-llm` (see the migration guide [here](docs/mddocs/Quickstart/bigdl_llm_migration.md)); you may find the original `BigDL` project [here](https://github.com/intel-analytics/bigdl-2.x).
- [2024/02] `ipex-llm` now supports directly loading model from [ModelScope](python/llm/example/GPU/ModelScope-Models) ([魔搭](python/llm/example/CPU/ModelScope-Models)).
-- [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
+- [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
- [2024/02] Users can now use `ipex-llm` through [Text-Generation-WebUI](https://github.com/intel-analytics/text-generation-webui) GUI.
- [2024/02] `ipex-llm` now supports *[Self-Speculative Decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md)*, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel [GPU](python/llm/example/GPU/Speculative-Decoding) and [CPU](python/llm/example/CPU/Speculative-Decoding) respectively.
- [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)).
- [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)).
- [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*).
-- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) on both Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral).
+- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral).
- [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*).
-- [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) on Intel ***GPU***.
-- [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available.
+- [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***.
+- [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available.
- [2023/11] `ipex-llm` now supports [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) on both Intel [GPU](python/llm/example/GPU/vLLM-Serving) and [CPU](python/llm/example/CPU/vLLM-Serving).
- [2023/10] `ipex-llm` now supports [QLoRA finetuning](python/llm/example/GPU/LLM-Finetuning/QLoRA) on both Intel [GPU](python/llm/example/GPU/LLM-Finetuning/QLoRA) and [CPU](python/llm/example/CPU/QLoRA-FineTuning).
- [2023/10] `ipex-llm` now supports [FastChat serving](python/llm/src/ipex_llm/llm/serving) on on both Intel CPU and GPU.
@@ -197,10 +197,10 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the
### Code Examples
- Low bit inference
- - [INT4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/Model): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model)
- - [FP8/FP4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types)
- - [INT8 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types)
- - [INT2 inference](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2)
+ - [INT4 inference](python/llm/example/GPU/HuggingFace/LLM): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model)
+ - [FP8/FP4 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types)
+ - [INT8 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types)
+ - [INT2 inference](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2)
- FP16/BF16 inference
- **FP16** LLM inference on Intel [GPU](python/llm/example/GPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization
- **BF16** LLM inference on Intel [CPU](python/llm/example/CPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization
@@ -209,14 +209,14 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the
- **DeepSpeed AutoTP** inference on Intel [GPU](python/llm/example/GPU/Deepspeed-AutoTP)
- Save and load
- [Low-bit models](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load): saving and loading `ipex-llm` low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)
- - [GGUF](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm`
- - [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm`
- - [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm`
+ - [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm`
+ - [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm`
+ - [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm`
- Finetuning
- LLM finetuning on Intel [GPU](python/llm/example/GPU/LLM-Finetuning), including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)
- QLoRA finetuning on Intel [CPU](python/llm/example/CPU/QLoRA-FineTuning)
- Integration with community libraries
- - [HuggingFace transformers](python/llm/example/GPU/HF-Transformers-AutoModels)
+ - [HuggingFace transformers](python/llm/example/GPU/HuggingFace)
- [Standard PyTorch model](python/llm/example/GPU/PyTorch-Models)
- [LangChain](python/llm/example/GPU/LangChain)
- [LlamaIndex](python/llm/example/GPU/LlamaIndex)
@@ -240,69 +240,69 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
| Model | CPU Example | GPU Example |
|------------|----------------------------------------------------------------|-----------------------------------------------------------------|
-| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |[link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)|
-| LLaMA 2 | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) |
-| LLaMA 3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3) |
+| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |[link](python/llm/example/GPU/HuggingFace/LLM/vicuna)|
+| LLaMA 2 | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | [link](python/llm/example/GPU/HuggingFace/LLM/llama2) |
+| LLaMA 3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3) | [link](python/llm/example/GPU/HuggingFace/LLM/llama3) |
| ChatGLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm) | |
-| ChatGLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) |
-| ChatGLM3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3) |
-| GLM-4 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4) |
-| GLM-4V | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm-4v) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v) |
-| Mistral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) |
-| Mixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) |
-| Falcon | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) |
-| MPT | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt) |
-| Dolly-v1 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1) |
-| Dolly-v2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2) |
-| Replit Code| [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit) |
+| ChatGLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm2) |
+| ChatGLM3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm3) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm3) |
+| GLM-4 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4) | [link](python/llm/example/GPU/HuggingFace/LLM/glm4) |
+| GLM-4V | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm-4v) | [link](python/llm/example/GPU/HuggingFace/Multimodal/glm-4v) |
+| Mistral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) | [link](python/llm/example/GPU/HuggingFace/LLM/mistral) |
+| Mixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral) | [link](python/llm/example/GPU/HuggingFace/LLM/mixtral) |
+| Falcon | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) | [link](python/llm/example/GPU/HuggingFace/LLM/falcon) |
+| MPT | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | [link](python/llm/example/GPU/HuggingFace/LLM/mpt) |
+| Dolly-v1 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | [link](python/llm/example/GPU/HuggingFace/LLM/dolly-v1) |
+| Dolly-v2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | [link](python/llm/example/GPU/HuggingFace/LLM/dolly-v2) |
+| Replit Code| [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | [link](python/llm/example/GPU/HuggingFace/LLM/replit) |
| RedPajama | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/redpajama) | |
| Phoenix | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phoenix) | |
-| StarCoder | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) |
-| Baichuan | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan) |
-| Baichuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) |
-| InternLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) |
-| Qwen | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) |
-| Qwen1.5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen1.5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5) |
-| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2) |
-| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl) |
-| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) |
-| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2) |
+| StarCoder | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) | [link](python/llm/example/GPU/HuggingFace/LLM/starcoder) |
+| Baichuan | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | [link](python/llm/example/GPU/HuggingFace/LLM/baichuan) |
+| Baichuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) | [link](python/llm/example/GPU/HuggingFace/LLM/baichuan2) |
+| InternLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | [link](python/llm/example/GPU/HuggingFace/LLM/internlm) |
+| Qwen | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen) |
+| Qwen1.5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen1.5) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen1.5) |
+| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen2) |
+| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl) |
+| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila) |
+| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila2) |
| MOSS | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/moss) | |
-| Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) |
-| Phi-1_5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-1_5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5) |
-| Flan-t5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/flan-t5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5) |
+| Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) | [link](python/llm/example/GPU/HuggingFace/Multimodal/whisper) |
+| Phi-1_5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-1_5) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-1_5) |
+| Flan-t5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/flan-t5) | [link](python/llm/example/GPU/HuggingFace/LLM/flan-t5) |
| LLaVA | [link](python/llm/example/CPU/PyTorch-Models/Model/llava) | [link](python/llm/example/GPU/PyTorch-Models/Model/llava) |
-| CodeLlama | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama) |
+| CodeLlama | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](python/llm/example/GPU/HuggingFace/LLM/codellama) |
| Skywork | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/skywork) | |
| InternLM-XComposer | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer) | |
| WizardCoder-Python | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/wizardcoder-python) | |
| CodeShell | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codeshell) | |
| Fuyu | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/fuyu) | |
-| Distil-Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/distil-whisper) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper) |
-| Yi | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yi) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi) |
-| BlueLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/bluelm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm) |
+| Distil-Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/distil-whisper) | [link](python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper) |
+| Yi | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yi) | [link](python/llm/example/GPU/HuggingFace/LLM/yi) |
+| BlueLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/bluelm) | [link](python/llm/example/GPU/HuggingFace/LLM/bluelm) |
| Mamba | [link](python/llm/example/CPU/PyTorch-Models/Model/mamba) | [link](python/llm/example/GPU/PyTorch-Models/Model/mamba) |
-| SOLAR | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/solar) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar) |
-| Phixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phixtral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral) |
-| InternLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2) |
-| RWKV4 | | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4) |
-| RWKV5 | | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5) |
+| SOLAR | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/solar) | [link](python/llm/example/GPU/HuggingFace/LLM/solar) |
+| Phixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phixtral) | [link](python/llm/example/GPU/HuggingFace/LLM/phixtral) |
+| InternLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm2) | [link](python/llm/example/GPU/HuggingFace/LLM/internlm2) |
+| RWKV4 | | [link](python/llm/example/GPU/HuggingFace/LLM/rwkv4) |
+| RWKV5 | | [link](python/llm/example/GPU/HuggingFace/LLM/rwkv5) |
| Bark | [link](python/llm/example/CPU/PyTorch-Models/Model/bark) | [link](python/llm/example/GPU/PyTorch-Models/Model/bark) |
| SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) |
| DeepSeek-MoE | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek-moe) | |
| Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
-| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
-| Phi-3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3) |
-| Phi-3-vision | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3-vision) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision) |
-| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
-| Gemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/gemma) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma) |
-| DeciLM-7B | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deciLM-7b) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b) |
-| Deepseek | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek) |
-| StableLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/stablelm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm) |
-| CodeGemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegemma) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma) |
-| Command-R/cohere | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/cohere) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere) |
-| CodeGeeX2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegeex2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2) |
-| MiniCPM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm) |
+| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-2) |
+| Phi-3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-3) |
+| Phi-3-vision | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3-vision) | [link](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision) |
+| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HuggingFace/LLM/yuan2) |
+| Gemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/gemma) | [link](python/llm/example/GPU/HuggingFace/LLM/gemma) |
+| DeciLM-7B | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deciLM-7b) | [link](python/llm/example/GPU/HuggingFace/LLM/deciLM-7b) |
+| Deepseek | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek) | [link](python/llm/example/GPU/HuggingFace/LLM/deepseek) |
+| StableLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/stablelm) | [link](python/llm/example/GPU/HuggingFace/LLM/stablelm) |
+| CodeGemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegemma) | [link](python/llm/example/GPU/HuggingFace/LLM/codegemma) |
+| Command-R/cohere | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/cohere) | [link](python/llm/example/GPU/HuggingFace/LLM/cohere) |
+| CodeGeeX2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegeex2) | [link](python/llm/example/GPU/HuggingFace/LLM/codegeex2) |
+| MiniCPM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm) | [link](python/llm/example/GPU/HuggingFace/LLM/minicpm) |
## Get Support
- Please report a bug or raise a feature request by opening a [Github Issue](https://github.com/intel-analytics/ipex-llm/issues)
diff --git a/docker/llm/inference/xpu/docker/Dockerfile b/docker/llm/inference/xpu/docker/Dockerfile
index 89064cb0a2e..7a812482db7 100644
--- a/docker/llm/inference/xpu/docker/Dockerfile
+++ b/docker/llm/inference/xpu/docker/Dockerfile
@@ -53,7 +53,7 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO
# Download all-in-one benchmark and examples
git clone https://github.com/intel-analytics/ipex-llm && \
cp -r ./ipex-llm/python/llm/dev/benchmark/ ./benchmark && \
- cp -r ./ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model ./examples && \
+ cp -r ./ipex-llm/python/llm/example/GPU/HuggingFace/LLM ./examples && \
# Install vllm dependencies
pip install --upgrade fastapi && \
pip install --upgrade "uvicorn[standard]" && \
diff --git a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md
index 4e5b2cdaaea..7278f2988ca 100644
--- a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md
+++ b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md
@@ -94,7 +94,7 @@ Start ipex-llm-xpu Docker Container. Choose one of the following commands to sta
Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container`
-Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`.
+Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HuggingFace/LLM`.
diff --git a/docs/mddocs/Overview/FAQ/faq.md b/docs/mddocs/Overview/FAQ/faq.md
index ab8f0df3385..2d57971afef 100644
--- a/docs/mddocs/Overview/FAQ/faq.md
+++ b/docs/mddocs/Overview/FAQ/faq.md
@@ -4,7 +4,7 @@
### GGUF format usage with IPEX-LLM?
-IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
+IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Advanced-Quantizations).
Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
diff --git a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md
index 373d7e6c4be..3f6d3b9cc1b 100644
--- a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md
+++ b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md
@@ -23,7 +23,7 @@ output = tokenizer.batch_decode(output_ids)
```
> [!TIP]
-> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels).
+> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace).
> [!NOTE]
> You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
@@ -32,7 +32,7 @@ output = tokenizer.batch_decode(output_ids)
> model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
> ```
>
-> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types).
+> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/More-Data-Types).
## Save & Load
@@ -45,4 +45,4 @@ new_model = AutoModelForCausalLM.load_low_bit(model_path)
```
> [!TIP]
-> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load).
+> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Save-Load).
diff --git a/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md
index 9a07609dc53..1b7fe28c0f8 100644
--- a/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md
+++ b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md
@@ -99,7 +99,7 @@ Start ipex-llm-xpu Docker Container:
Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container`
-Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`.
+Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HuggingFace/LLM/`.
diff --git a/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md b/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md
index caf8bd51648..d62517c955d 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md
@@ -4,7 +4,7 @@
### GGUF format usage with IPEX-LLM?
-IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
+IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Advanced-Quantizations).
Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
## How to Resolve Errors
diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md
index 0eee498f671..1e7aae9d16a 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md
@@ -25,7 +25,7 @@ output = tokenizer.batch_decode(output_ids)
```eval_rst
.. seealso::
- See the complete CPU examples `here `_ and GPU examples `here `_.
+ See the complete CPU examples `here `_ and GPU examples `here `_.
.. note::
@@ -35,7 +35,7 @@ output = tokenizer.batch_decode(output_ids)
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
- See the CPU example `here `_ and GPU example `here `_.
+ See the CPU example `here `_ and GPU example `here `_.
```
## Save & Load
@@ -50,5 +50,5 @@ new_model = AutoModelForCausalLM.load_low_bit(model_path)
```eval_rst
.. seealso::
- See the CPU example `here `_ and GPU example `here `_
+ See the CPU example `here `_ and GPU example `here `_
```
\ No newline at end of file
diff --git a/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md
index 8eea9f9f865..4ba6f453481 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md
@@ -37,29 +37,29 @@ The following models have been verified on either servers or laptops with Intel
| Model | Example of `transformers`-style API |
|------------|-------------------------------------------------------|
-| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)|
-| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) |
-| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) |
-| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) |
-| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) |
+| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/vicuna)|
+| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/llama2) |
+| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chatglm2) |
+| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/mistral) |
+| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/falcon) |
| MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) |
| Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) |
| Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) |
| Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) |
-| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) |
+| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/starcoder) |
| Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) |
-| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) |
-| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) |
-| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) |
-| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) |
-| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) |
-| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2) |
-| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j) |
+| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/baichuan2) |
+| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/internlm) |
+| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/qwen) |
+| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/aquila) |
+| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/whisper) |
+| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2) |
+| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/gpt-j) |
```eval_rst
.. important::
- In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_.
+ In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_.
```
diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst
index b7125664597..dbd830d9660 100644
--- a/docs/readthedocs/source/index.rst
+++ b/docs/readthedocs/source/index.rst
@@ -33,7 +33,7 @@
It is built on top of the excellent work of llama.cpp
, transfromers
, bitsandbytes
, vLLM
, qlora
, AutoGPTQ
, AutoAWQ
, etc.
- It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
+ It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm
(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
@@ -47,11 +47,11 @@ Latest update 🔥
* [2024/05] ``ipex-llm`` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart `here `_.
* [2024/04] You can now run **Open WebUI** on Intel GPU using ``ipex-llm``; see the quickstart `here `_.
* [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_.
-* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_.
+* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_.
* [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU.
* [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_.
* [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_).
-* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
+* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
* [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI.
* [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively.
* [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_).
@@ -62,10 +62,10 @@ Latest update 🔥
:color: primary
* [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_).
- * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
+ * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_.
* [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_).
- * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
- * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
+ * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**.
+ * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available.
* [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_.
* [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU.
@@ -197,10 +197,10 @@ Code Examples
============================================
* Low bit inference
- * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
- * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
- * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
- * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
+ * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_
+ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_
+ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_
+ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_
* FP16/BF16 inference
@@ -210,9 +210,9 @@ Code Examples
* Save and load
* `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models
- * `GGUF `_: directly loading GGUF models into ``ipex-llm``
- * `AWQ `_: directly loading AWQ models into ``ipex-llm``
- * `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
+ * `GGUF `_: directly loading GGUF models into ``ipex-llm``
+ * `AWQ `_: directly loading AWQ models into ``ipex-llm``
+ * `GPTQ `_: directly loading GPTQ models into ``ipex-llm``
* Finetuning
@@ -221,7 +221,7 @@ Code Examples
* Integration with community libraries
- * `HuggingFace transformers `_
+ * `HuggingFace transformers `_
* `Standard PyTorch model `_
* `DeepSpeed-AutoTP `_
* `HuggingFace PEFT `_
@@ -267,8 +267,8 @@ Verified Models
link1,
link2
- link
- link |
+ link
+ link
LLaMA 2 |
@@ -276,15 +276,15 @@ Verified Models
link1,
link2
- link
- link |
+ link
+ link
LLaMA 3 |
link |
- link |
+ link
ChatGLM |
@@ -297,77 +297,77 @@ Verified Models
link |
- link |
+ link
ChatGLM3 |
link |
- link |
+ link
GLM-4 |
link |
- link |
+ link
GLM-4V |
link |
- link |
+ link
Mistral |
link |
- link |
+ link
Mixtral |
link |
- link |
+ link
Falcon |
link |
- link |
+ link
MPT |
link |
- link |
+ link
Dolly-v1 |
link |
- link |
+ link
Dolly-v2 |
link |
- link |
+ link
Replit Code |
link |
- link |
+ link
RedPajama |
@@ -389,70 +389,70 @@ Verified Models
link1,
link2
- link |
+ link
Baichuan |
link |
- link |
+ link
Baichuan2 |
link |
- link |
+ link
InternLM |
link |
- link |
+ link
Qwen |
link |
- link |
+ link
Qwen1.5 |
link |
- link |
+ link
Qwen2 |
link |
- link |
+ link
Qwen-VL |
link |
- link |
+ link
Aquila |
link |
- link |
+ link
Aquila2 |
link |
- link |
+ link
MOSS |
@@ -465,21 +465,21 @@ Verified Models
link |
- link |
+ link
Phi-1_5 |
link |
- link |
+ link
Flan-t5 |
link |
- link |
+ link
LLaVA |
@@ -493,7 +493,7 @@ Verified Models
link |
- link |
+ link
Skywork |
@@ -530,21 +530,21 @@ Verified Models
link |
- link |
+ link
Yi |
link |
- link |
+ link
BlueLM |
link |
- link |
+ link
Mamba |
@@ -558,33 +558,33 @@ Verified Models
link |
- link |
+ link
Phixtral |
link |
- link |
+ link
InternLM2 |
link |
- link |
+ link
RWKV4 |
|
- link |
+ link
RWKV5 |
|
- link |
+ link
Bark |
@@ -616,84 +616,84 @@ Verified Models
link |
- link |
+ link
Phi-3 |
link |
- link |
+ link
Phi-3-vision |
link |
- link |
+ link
Yuan2 |
link |
- link |
+ link
Gemma |
link |
- link |
+ link
DeciLM-7B |
link |
- link |
+ link
Deepseek |
link |
- link |
+ link
StableLM |
link |
- link |
+ link
CodeGemma |
link |
- link |
+ link
Command-R/cohere |
link |
- link |
+ link
CodeGeeX2 |
link |
- link |
+ link
MiniCPM |
link |
- link |
+ link
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/README.md
deleted file mode 100644
index ba1b370ad9f..00000000000
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/README.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# Running HuggingFace `transformers` model using IPEX-LLM on Intel GPU
-
-This folder contains examples of running any HuggingFace `transformers` model on IPEX-LLM (using the standard AutoModel APIs):
-
-- [Model](Model): examples of running HuggingFace transformers models (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) using INT4 optimizations
-- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (FP8/INT8/FP4, etc.)
-- [Save-Load](Save-Load): examples of saving and loading low-bit models
-- [Advanced-Quantizations](Advanced-Quantizations): examples of loading GGUF/AWQ/GPTQ models
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/README.md
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/generate.py
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/README.md
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/README.md
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/generate.py
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/generate.py
rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md b/python/llm/example/GPU/HuggingFace/LLM/README.md
similarity index 99%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/README.md
index 4e60a656dd3..6d0a8967b73 100644
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/README.md
@@ -1,5 +1,2 @@
# IPEX-LLM Transformers INT4 Optimization for Large Language Model on Intel GPUs
You can use IPEX-LLM to run almost every Huggingface Transformer models with INT4 optimizations on your laptops with Intel GPUs. This directory contains example scripts to help you quickly get started using IPEX-LLM to run some popular open-source models in the community. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
-
-
-
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md b/python/llm/example/GPU/HuggingFace/LLM/aquila/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/aquila/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py b/python/llm/example/GPU/HuggingFace/LLM/aquila/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/aquila/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md b/python/llm/example/GPU/HuggingFace/LLM/aquila2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/aquila2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/aquila2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/aquila2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md b/python/llm/example/GPU/HuggingFace/LLM/baichuan/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/baichuan/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py b/python/llm/example/GPU/HuggingFace/LLM/baichuan/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/baichuan/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md b/python/llm/example/GPU/HuggingFace/LLM/baichuan2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/baichuan2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/baichuan2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/baichuan2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md b/python/llm/example/GPU/HuggingFace/LLM/bluelm/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/bluelm/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/bluelm/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/bluelm/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/README.md b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/streamchat.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/streamchat.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/streamchat.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py
rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/streamchat.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md b/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegemma/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/codegemma/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codegemma/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/codegemma/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codellama/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/codellama/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md b/python/llm/example/GPU/HuggingFace/LLM/codellama/readme.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md
rename to python/llm/example/GPU/HuggingFace/LLM/codellama/readme.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/README.md b/python/llm/example/GPU/HuggingFace/LLM/codeshell/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/codeshell/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/server.py b/python/llm/example/GPU/HuggingFace/LLM/codeshell/server.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/server.py
rename to python/llm/example/GPU/HuggingFace/LLM/codeshell/server.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/README.md b/python/llm/example/GPU/HuggingFace/LLM/cohere/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/cohere/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/generate.py b/python/llm/example/GPU/HuggingFace/LLM/cohere/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/cohere/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/README.md b/python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/generate.py b/python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/README.md b/python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/generate.py b/python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md b/python/llm/example/GPU/HuggingFace/LLM/dolly-v1/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v1/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py b/python/llm/example/GPU/HuggingFace/LLM/dolly-v1/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v1/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md b/python/llm/example/GPU/HuggingFace/LLM/dolly-v2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/dolly-v2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md b/python/llm/example/GPU/HuggingFace/LLM/falcon/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/falcon/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/falcon-7b-instruct/modelling_RW.py b/python/llm/example/GPU/HuggingFace/LLM/falcon/falcon-7b-instruct/modelling_RW.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/falcon-7b-instruct/modelling_RW.py
rename to python/llm/example/GPU/HuggingFace/LLM/falcon/falcon-7b-instruct/modelling_RW.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py b/python/llm/example/GPU/HuggingFace/LLM/falcon/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/falcon/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md b/python/llm/example/GPU/HuggingFace/LLM/flan-t5/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/flan-t5/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/flan-t5/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/flan-t5/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/README.md b/python/llm/example/GPU/HuggingFace/LLM/gemma/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/gemma/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py b/python/llm/example/GPU/HuggingFace/LLM/gemma/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/gemma/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/README.md b/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/glm4/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/generate.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py
rename to python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py b/python/llm/example/GPU/HuggingFace/LLM/gpt-j/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/gpt-j/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md b/python/llm/example/GPU/HuggingFace/LLM/gpt-j/readme.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md
rename to python/llm/example/GPU/HuggingFace/LLM/gpt-j/readme.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md b/python/llm/example/GPU/HuggingFace/LLM/internlm/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/internlm/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/internlm/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/internlm/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/README.md b/python/llm/example/GPU/HuggingFace/LLM/internlm2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/internlm2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/internlm2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/internlm2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md b/python/llm/example/GPU/HuggingFace/LLM/llama2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/llama2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/llama2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/llama2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/README.md b/python/llm/example/GPU/HuggingFace/LLM/llama3/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/llama3/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/llama3/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/llama3/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/README.md b/python/llm/example/GPU/HuggingFace/LLM/minicpm/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/minicpm/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/minicpm/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/minicpm/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md b/python/llm/example/GPU/HuggingFace/LLM/mistral/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/mistral/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mistral/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/mistral/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md b/python/llm/example/GPU/HuggingFace/LLM/mixtral/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/mixtral/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mixtral/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/mixtral/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md b/python/llm/example/GPU/HuggingFace/LLM/mpt/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/mpt/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mpt/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/mpt/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-1_5/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/phi-1_5/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-1_5/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/phi-1_5/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/phi-2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/phi-2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-3/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/phi-3/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-3/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/phi-3/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/README.md b/python/llm/example/GPU/HuggingFace/LLM/phixtral/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/phixtral/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phixtral/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/phixtral/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/qwen/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen1.5/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/qwen1.5/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen1.5/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/qwen1.5/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/README.md b/python/llm/example/GPU/HuggingFace/LLM/redpajama/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/redpajama/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py b/python/llm/example/GPU/HuggingFace/LLM/redpajama/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/redpajama/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md b/python/llm/example/GPU/HuggingFace/LLM/replit/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/replit/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py b/python/llm/example/GPU/HuggingFace/LLM/replit/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/replit/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/README.md b/python/llm/example/GPU/HuggingFace/LLM/rwkv4/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/rwkv4/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py b/python/llm/example/GPU/HuggingFace/LLM/rwkv4/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/rwkv4/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/README.md b/python/llm/example/GPU/HuggingFace/LLM/rwkv5/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/rwkv5/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/rwkv5/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/rwkv5/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md b/python/llm/example/GPU/HuggingFace/LLM/solar/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/solar/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py b/python/llm/example/GPU/HuggingFace/LLM/solar/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/solar/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/README.md b/python/llm/example/GPU/HuggingFace/LLM/stablelm/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/stablelm/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/stablelm/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/stablelm/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py b/python/llm/example/GPU/HuggingFace/LLM/starcoder/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/starcoder/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md b/python/llm/example/GPU/HuggingFace/LLM/starcoder/readme.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md
rename to python/llm/example/GPU/HuggingFace/LLM/starcoder/readme.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md b/python/llm/example/GPU/HuggingFace/LLM/vicuna/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/vicuna/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py b/python/llm/example/GPU/HuggingFace/LLM/vicuna/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/vicuna/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/yi/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md b/python/llm/example/GPU/HuggingFace/LLM/yuan2/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md
rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yuan2/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py
rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json b/python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/config.json
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json
rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/config.json
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py b/python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/yuan_hf_model.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/yuan_hf_model.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md
rename to python/llm/example/GPU/HuggingFace/More-Data-Types/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py b/python/llm/example/GPU/HuggingFace/More-Data-Types/transformers_low_bit_pipeline.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py
rename to python/llm/example/GPU/HuggingFace/More-Data-Types/transformers_low_bit_pipeline.py
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/README.md
new file mode 100644
index 00000000000..56a57505d49
--- /dev/null
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/README.md
@@ -0,0 +1,3 @@
+# Running HuggingFace multimodal model using IPEX-LLM on Intel GPU
+
+This folder contains examples of running multimodal models model on IPEX-LLM. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it.
diff --git a/python/llm/example/GPU/StableDiffusion/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/README.md
similarity index 100%
rename from python/llm/example/GPU/StableDiffusion/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/README.md
diff --git a/python/llm/example/GPU/StableDiffusion/lora-lcm.py b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/lora-lcm.py
similarity index 100%
rename from python/llm/example/GPU/StableDiffusion/lora-lcm.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/lora-lcm.py
diff --git a/python/llm/example/GPU/StableDiffusion/sdxl.py b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/sdxl.py
similarity index 100%
rename from python/llm/example/GPU/StableDiffusion/sdxl.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/sdxl.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/recognize.py b/python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/recognize.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/recognize.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/recognize.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/generate.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/generate.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/chat.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/chat.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/generate.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md
rename to python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/recognize.py b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/recognize.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/recognize.py
rename to python/llm/example/GPU/HuggingFace/Multimodal/whisper/recognize.py
diff --git a/python/llm/example/GPU/HuggingFace/README.md b/python/llm/example/GPU/HuggingFace/README.md
new file mode 100644
index 00000000000..8dbae40a24f
--- /dev/null
+++ b/python/llm/example/GPU/HuggingFace/README.md
@@ -0,0 +1,9 @@
+# Running HuggingFace models using IPEX-LLM on Intel GPU
+
+This folder contains examples of running any HuggingFace model on IPEX-LLM:
+
+- [LLM](LLM): examples of running large language models (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) using IPEX-LLM optimizations
+- [Multimodal](Multimodal): examples of running large multimodal models (StableDiffusion models, Qwen-VL-Chat, glm-4v, etc.) using IPEX-LLM optimizations
+- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (FP8/INT8/FP4, etc.)
+- [Save-Load](Save-Load): examples of saving and loading low-bit models
+- [Advanced-Quantizations](Advanced-Quantizations): examples of loading GGUF/AWQ/GPTQ models
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/README.md b/python/llm/example/GPU/HuggingFace/Save-Load/README.md
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/README.md
rename to python/llm/example/GPU/HuggingFace/Save-Load/README.md
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py b/python/llm/example/GPU/HuggingFace/Save-Load/generate.py
similarity index 100%
rename from python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py
rename to python/llm/example/GPU/HuggingFace/Save-Load/generate.py
diff --git a/python/llm/example/GPU/README.md b/python/llm/example/GPU/README.md
index ab13bf95485..dc7600c6bf0 100644
--- a/python/llm/example/GPU/README.md
+++ b/python/llm/example/GPU/README.md
@@ -3,7 +3,7 @@
This folder contains examples of running IPEX-LLM on Intel GPU:
- [Applications](Applications): running LLM applications (such as autogen) on IPEX-LLM
-- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any ***Hugging Face Transformers*** model on IPEX-LLM (using the standard AutoModel APIs)
+- [HuggingFace](HuggingFace): running ***HuggingFace*** models on IPEX-LLM (using the standard AutoModel APIs), including language models and multimodal models.
- [LLM-Finetuning](LLM-Finetuning): running ***finetuning*** (such as LoRA, QLoRA, QA-LoRA, etc) using IPEX-LLM on Intel GPUs
- [vLLM-Serving](vLLM-Serving): running ***vLLM*** serving framework on intel GPUs (with IPEX-LLM low-bit optimized models)
- [Deepspeed-AutoTP](Deepspeed-AutoTP): running distributed inference using ***DeepSpeed AutoTP*** (with IPEX-LLM low-bit optimized models) on Intel GPUs
@@ -15,7 +15,6 @@ This folder contains examples of running IPEX-LLM on Intel GPU:
- [Speculative-Decoding](Speculative-Decoding): running any ***Hugging Face Transformers*** model with ***self-speculative decoding*** on Intel GPUs
- [ModelScope-Models](ModelScope-Models): running ***ModelScope*** model with IPEX-LLM on Intel GPUs
- [Long-Context](Long-Context): running **long-context** generation with IPEX-LLM on Intel Arc™ A770 Graphics.
-- [StableDiffusion](StableDiffusion): running **stable diffusion** with IPEX-LLM on Intel GPUs.
## System Support