diff --git a/README.md b/README.md index 542d78ba7d5..618b0148881 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ **`IPEX-LLM`** is a PyTorch library for running **LLM** on Intel CPU and GPU *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)* with very low latency[^1]. > [!NOTE] > - *It is built on top of the excellent work of **`llama.cpp`**, **`transformers`**, **`bitsandbytes`**, **`vLLM`**, **`qlora`**, **`AutoGPTQ`**, **`AutoAWQ`**, etc.* -> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HF-Transformers-AutoModels), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.* +> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HuggingFace), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.* > - ***50+ models** have been optimized/verified on `ipex-llm` (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list [here](#verified-models).* ## Latest Update 🔥 @@ -23,20 +23,20 @@ - [2024/04] You can now run **Open WebUI** on Intel GPU using `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md). - [2024/04] You can now run **Llama 3** on Intel GPU using `llama.cpp` and `ollama` with `ipex-llm`; see the quickstart [here](docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md). -- [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3). +- [2024/04] `ipex-llm` now supports **Llama 3** on both Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/llama3) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3). - [2024/04] `ipex-llm` now provides C++ interface, which can be used as an accelerated backend for running [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md) and [ollama](docs/mddocs/Quickstart/ollama_quickstart.md) on Intel GPU. - [2024/03] `bigdl-llm` has now become `ipex-llm` (see the migration guide [here](docs/mddocs/Quickstart/bigdl_llm_migration.md)); you may find the original `BigDL` project [here](https://github.com/intel-analytics/bigdl-2.x). - [2024/02] `ipex-llm` now supports directly loading model from [ModelScope](python/llm/example/GPU/ModelScope-Models) ([魔搭](python/llm/example/CPU/ModelScope-Models)). -- [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. +- [2024/02] `ipex-llm` added initial **INT2** support (based on llama.cpp [IQ2](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use `ipex-llm` through [Text-Generation-WebUI](https://github.com/intel-analytics/text-generation-webui) GUI. - [2024/02] `ipex-llm` now supports *[Self-Speculative Decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md)*, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel [GPU](python/llm/example/GPU/Speculative-Decoding) and [CPU](python/llm/example/CPU/Speculative-Decoding) respectively. - [2024/02] `ipex-llm` now supports a comprehensive list of LLM **finetuning** on Intel GPU (including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora)). - [2024/01] Using `ipex-llm` [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for [Standford-Alpaca](python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora) (see the blog [here](https://www.intel.com/content/www/us/en/developer/articles/technical/finetuning-llms-on-intel-gpus-using-bigdl-llm.html)). - [2023/12] `ipex-llm` now supports [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) (see *["ReLoRA: High-Rank Training Through Low-Rank Updates"](https://arxiv.org/abs/2307.05695)*). -- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) on both Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). +- [2023/12] `ipex-llm` now supports [Mixtral-8x7B](python/llm/example/GPU/HuggingFace/LLM/mixtral) on both Intel [GPU](python/llm/example/HuggingFace/LLM/mixtral) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral). - [2023/12] `ipex-llm` now supports [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) (see *["QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"](https://arxiv.org/abs/2309.14717)*). -- [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) on Intel ***GPU***. -- [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available. +- [2023/12] `ipex-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HuggingFace/More-Data-Types) on Intel ***GPU***. +- [2023/11] Initial support for directly loading [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF), [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ) models into `ipex-llm` is available. - [2023/11] `ipex-llm` now supports [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) on both Intel [GPU](python/llm/example/GPU/vLLM-Serving) and [CPU](python/llm/example/CPU/vLLM-Serving). - [2023/10] `ipex-llm` now supports [QLoRA finetuning](python/llm/example/GPU/LLM-Finetuning/QLoRA) on both Intel [GPU](python/llm/example/GPU/LLM-Finetuning/QLoRA) and [CPU](python/llm/example/CPU/QLoRA-FineTuning). - [2023/10] `ipex-llm` now supports [FastChat serving](python/llm/src/ipex_llm/llm/serving) on on both Intel CPU and GPU. @@ -197,10 +197,10 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the ### Code Examples - Low bit inference - - [INT4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/Model): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Model) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model) - - [FP8/FP4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) - - [INT8 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) - - [INT2 inference](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2) + - [INT4 inference](python/llm/example/GPU/HuggingFace/LLM): **INT4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/Model) + - [FP8/FP4 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **FP8** and **FP4** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) + - [INT8 inference](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types): **INT8** LLM inference on Intel [GPU](python/llm/example/GPU/HuggingFace/LLM/More-Data-Types) and [CPU](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) + - [INT2 inference](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2): **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel [GPU](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2) - FP16/BF16 inference - **FP16** LLM inference on Intel [GPU](python/llm/example/GPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization - **BF16** LLM inference on Intel [CPU](python/llm/example/CPU/Speculative-Decoding), with possible [self-speculative decoding](docs/mddocs/Inference/Self_Speculative_Decoding.md) optimization @@ -209,14 +209,14 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the - **DeepSpeed AutoTP** inference on Intel [GPU](python/llm/example/GPU/Deepspeed-AutoTP) - Save and load - [Low-bit models](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load): saving and loading `ipex-llm` low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) - - [GGUF](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm` - - [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm` - - [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm` + - [GGUF](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF): directly loading GGUF models into `ipex-llm` + - [AWQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ): directly loading AWQ models into `ipex-llm` + - [GPTQ](python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ): directly loading GPTQ models into `ipex-llm` - Finetuning - LLM finetuning on Intel [GPU](python/llm/example/GPU/LLM-Finetuning), including [LoRA](python/llm/example/GPU/LLM-Finetuning/LoRA), [QLoRA](python/llm/example/GPU/LLM-Finetuning/QLoRA), [DPO](python/llm/example/GPU/LLM-Finetuning/DPO), [QA-LoRA](python/llm/example/GPU/LLM-Finetuning/QA-LoRA) and [ReLoRA](python/llm/example/GPU/LLM-Finetuning/ReLora) - QLoRA finetuning on Intel [CPU](python/llm/example/CPU/QLoRA-FineTuning) - Integration with community libraries - - [HuggingFace transformers](python/llm/example/GPU/HF-Transformers-AutoModels) + - [HuggingFace transformers](python/llm/example/GPU/HuggingFace) - [Standard PyTorch model](python/llm/example/GPU/PyTorch-Models) - [LangChain](python/llm/example/GPU/LangChain) - [LlamaIndex](python/llm/example/GPU/LlamaIndex) @@ -240,69 +240,69 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM | Model | CPU Example | GPU Example | |------------|----------------------------------------------------------------|-----------------------------------------------------------------| -| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |[link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)| -| LLaMA 2 | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) | -| LLaMA 3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3) | +| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |[link](python/llm/example/GPU/HuggingFace/LLM/vicuna)| +| LLaMA 2 | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | [link](python/llm/example/GPU/HuggingFace/LLM/llama2) | +| LLaMA 3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3) | [link](python/llm/example/GPU/HuggingFace/LLM/llama3) | | ChatGLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm) | | -| ChatGLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) | -| ChatGLM3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3) | -| GLM-4 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4) | -| GLM-4V | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm-4v) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v) | -| Mistral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) | -| Mixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral) | -| Falcon | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) | -| MPT | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt) | -| Dolly-v1 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1) | -| Dolly-v2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2) | -| Replit Code| [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit) | +| ChatGLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm2) | +| ChatGLM3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm3) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm3) | +| GLM-4 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4) | [link](python/llm/example/GPU/HuggingFace/LLM/glm4) | +| GLM-4V | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm-4v) | [link](python/llm/example/GPU/HuggingFace/Multimodal/glm-4v) | +| Mistral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral) | [link](python/llm/example/GPU/HuggingFace/LLM/mistral) | +| Mixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mixtral) | [link](python/llm/example/GPU/HuggingFace/LLM/mixtral) | +| Falcon | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon) | [link](python/llm/example/GPU/HuggingFace/LLM/falcon) | +| MPT | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | [link](python/llm/example/GPU/HuggingFace/LLM/mpt) | +| Dolly-v1 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | [link](python/llm/example/GPU/HuggingFace/LLM/dolly-v1) | +| Dolly-v2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | [link](python/llm/example/GPU/HuggingFace/LLM/dolly-v2) | +| Replit Code| [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | [link](python/llm/example/GPU/HuggingFace/LLM/replit) | | RedPajama | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/redpajama) | | | Phoenix | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phoenix) | | -| StarCoder | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) | -| Baichuan | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan) | -| Baichuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) | -| InternLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) | -| Qwen | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) | -| Qwen1.5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen1.5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5) | -| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2) | -| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl) | -| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) | -| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2) | +| StarCoder | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) | [link](python/llm/example/GPU/HuggingFace/LLM/starcoder) | +| Baichuan | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | [link](python/llm/example/GPU/HuggingFace/LLM/baichuan) | +| Baichuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) | [link](python/llm/example/GPU/HuggingFace/LLM/baichuan2) | +| InternLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | [link](python/llm/example/GPU/HuggingFace/LLM/internlm) | +| Qwen | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen) | +| Qwen1.5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen1.5) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen1.5) | +| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen2) | +| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl) | +| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila) | +| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila2) | | MOSS | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/moss) | | -| Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) | -| Phi-1_5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-1_5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5) | -| Flan-t5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/flan-t5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5) | +| Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper) | [link](python/llm/example/GPU/HuggingFace/Multimodal/whisper) | +| Phi-1_5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-1_5) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-1_5) | +| Flan-t5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/flan-t5) | [link](python/llm/example/GPU/HuggingFace/LLM/flan-t5) | | LLaVA | [link](python/llm/example/CPU/PyTorch-Models/Model/llava) | [link](python/llm/example/GPU/PyTorch-Models/Model/llava) | -| CodeLlama | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama) | +| CodeLlama | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](python/llm/example/GPU/HuggingFace/LLM/codellama) | | Skywork | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/skywork) | | | InternLM-XComposer | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer) | | | WizardCoder-Python | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/wizardcoder-python) | | | CodeShell | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codeshell) | | | Fuyu | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/fuyu) | | -| Distil-Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/distil-whisper) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper) | -| Yi | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yi) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi) | -| BlueLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/bluelm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm) | +| Distil-Whisper | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/distil-whisper) | [link](python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper) | +| Yi | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yi) | [link](python/llm/example/GPU/HuggingFace/LLM/yi) | +| BlueLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/bluelm) | [link](python/llm/example/GPU/HuggingFace/LLM/bluelm) | | Mamba | [link](python/llm/example/CPU/PyTorch-Models/Model/mamba) | [link](python/llm/example/GPU/PyTorch-Models/Model/mamba) | -| SOLAR | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/solar) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar) | -| Phixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phixtral) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral) | -| InternLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2) | -| RWKV4 | | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4) | -| RWKV5 | | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5) | +| SOLAR | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/solar) | [link](python/llm/example/GPU/HuggingFace/LLM/solar) | +| Phixtral | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phixtral) | [link](python/llm/example/GPU/HuggingFace/LLM/phixtral) | +| InternLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm2) | [link](python/llm/example/GPU/HuggingFace/LLM/internlm2) | +| RWKV4 | | [link](python/llm/example/GPU/HuggingFace/LLM/rwkv4) | +| RWKV5 | | [link](python/llm/example/GPU/HuggingFace/LLM/rwkv5) | | Bark | [link](python/llm/example/CPU/PyTorch-Models/Model/bark) | [link](python/llm/example/GPU/PyTorch-Models/Model/bark) | | SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) | | DeepSeek-MoE | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek-moe) | | | Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | | -| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) | -| Phi-3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3) | -| Phi-3-vision | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3-vision) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision) | -| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2) | -| Gemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/gemma) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma) | -| DeciLM-7B | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deciLM-7b) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b) | -| Deepseek | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek) | -| StableLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/stablelm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm) | -| CodeGemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegemma) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma) | -| Command-R/cohere | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/cohere) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere) | -| CodeGeeX2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegeex2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2) | -| MiniCPM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm) | +| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-2) | +| Phi-3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3) | [link](python/llm/example/GPU/HuggingFace/LLM/phi-3) | +| Phi-3-vision | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-3-vision) | [link](python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision) | +| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HuggingFace/LLM/yuan2) | +| Gemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/gemma) | [link](python/llm/example/GPU/HuggingFace/LLM/gemma) | +| DeciLM-7B | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deciLM-7b) | [link](python/llm/example/GPU/HuggingFace/LLM/deciLM-7b) | +| Deepseek | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/deepseek) | [link](python/llm/example/GPU/HuggingFace/LLM/deepseek) | +| StableLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/stablelm) | [link](python/llm/example/GPU/HuggingFace/LLM/stablelm) | +| CodeGemma | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegemma) | [link](python/llm/example/GPU/HuggingFace/LLM/codegemma) | +| Command-R/cohere | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/cohere) | [link](python/llm/example/GPU/HuggingFace/LLM/cohere) | +| CodeGeeX2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codegeex2) | [link](python/llm/example/GPU/HuggingFace/LLM/codegeex2) | +| MiniCPM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/minicpm) | [link](python/llm/example/GPU/HuggingFace/LLM/minicpm) | ## Get Support - Please report a bug or raise a feature request by opening a [Github Issue](https://github.com/intel-analytics/ipex-llm/issues) diff --git a/docker/llm/inference/xpu/docker/Dockerfile b/docker/llm/inference/xpu/docker/Dockerfile index 89064cb0a2e..7a812482db7 100644 --- a/docker/llm/inference/xpu/docker/Dockerfile +++ b/docker/llm/inference/xpu/docker/Dockerfile @@ -53,7 +53,7 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO # Download all-in-one benchmark and examples git clone https://github.com/intel-analytics/ipex-llm && \ cp -r ./ipex-llm/python/llm/dev/benchmark/ ./benchmark && \ - cp -r ./ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model ./examples && \ + cp -r ./ipex-llm/python/llm/example/GPU/HuggingFace/LLM ./examples && \ # Install vllm dependencies pip install --upgrade fastapi && \ pip install --upgrade "uvicorn[standard]" && \ diff --git a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md index 4e5b2cdaaea..7278f2988ca 100644 --- a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md +++ b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md @@ -94,7 +94,7 @@ Start ipex-llm-xpu Docker Container. Choose one of the following commands to sta Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container` -Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`. +Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HuggingFace/LLM`. diff --git a/docs/mddocs/Overview/FAQ/faq.md b/docs/mddocs/Overview/FAQ/faq.md index ab8f0df3385..2d57971afef 100644 --- a/docs/mddocs/Overview/FAQ/faq.md +++ b/docs/mddocs/Overview/FAQ/faq.md @@ -4,7 +4,7 @@ ### GGUF format usage with IPEX-LLM? -IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations). +IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Advanced-Quantizations). Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support. diff --git a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md index 373d7e6c4be..3f6d3b9cc1b 100644 --- a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md +++ b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md @@ -23,7 +23,7 @@ output = tokenizer.batch_decode(output_ids) ``` > [!TIP] -> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels). +> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace). > [!NOTE] > You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: @@ -32,7 +32,7 @@ output = tokenizer.batch_decode(output_ids) > model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") > ``` > -> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types). +> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/More-Data-Types). ## Save & Load @@ -45,4 +45,4 @@ new_model = AutoModelForCausalLM.load_low_bit(model_path) ``` > [!TIP] -> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load). +> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Save-Load). diff --git a/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md index 9a07609dc53..1b7fe28c0f8 100644 --- a/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md +++ b/docs/readthedocs/source/doc/LLM/DockerGuides/docker_run_pytorch_inference_in_vscode.md @@ -99,7 +99,7 @@ Start ipex-llm-xpu Docker Container: Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container` -Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`. +Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HuggingFace/LLM/`. diff --git a/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md b/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md index caf8bd51648..d62517c955d 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md +++ b/docs/readthedocs/source/doc/LLM/Overview/FAQ/faq.md @@ -4,7 +4,7 @@ ### GGUF format usage with IPEX-LLM? -IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations). +IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Advanced-Quantizations). Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support. ## How to Resolve Errors diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md index 0eee498f671..1e7aae9d16a 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md @@ -25,7 +25,7 @@ output = tokenizer.batch_decode(output_ids) ```eval_rst .. seealso:: - See the complete CPU examples `here `_ and GPU examples `here `_. + See the complete CPU examples `here `_ and GPU examples `here `_. .. note:: @@ -35,7 +35,7 @@ output = tokenizer.batch_decode(output_ids) model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") - See the CPU example `here `_ and GPU example `here `_. + See the CPU example `here `_ and GPU example `here `_. ``` ## Save & Load @@ -50,5 +50,5 @@ new_model = AutoModelForCausalLM.load_low_bit(model_path) ```eval_rst .. seealso:: - See the CPU example `here `_ and GPU example `here `_ + See the CPU example `here `_ and GPU example `here `_ ``` \ No newline at end of file diff --git a/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md index 8eea9f9f865..4ba6f453481 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md +++ b/docs/readthedocs/source/doc/LLM/Overview/examples_gpu.md @@ -37,29 +37,29 @@ The following models have been verified on either servers or laptops with Intel | Model | Example of `transformers`-style API | |------------|-------------------------------------------------------| -| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)| -| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) | -| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2) | -| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral) | -| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon) | +| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/vicuna)| +| LLaMA 2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/llama2) | +| ChatGLM2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chatglm2) | +| Mistral | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/mistral) | +| Falcon | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/falcon) | | MPT | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt) | | Dolly-v1 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) | | Dolly-v2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) | | Replit | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit) | -| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder) | +| StarCoder | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/starcoder) | | Baichuan | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan) | -| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2) | -| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) | -| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) | -| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) | -| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper) | -| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2) | -| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j) | +| Baichuan2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/baichuan2) | +| InternLM | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/internlm) | +| Qwen | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/qwen) | +| Aquila | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/aquila) | +| Whisper | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/Multimodal/whisper) | +| Chinese Llama2 | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2) | +| GPT-J | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/gpt-j) | ```eval_rst .. important:: - In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_. + In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example `_. ``` diff --git a/docs/readthedocs/source/index.rst b/docs/readthedocs/source/index.rst index b7125664597..dbd830d9660 100644 --- a/docs/readthedocs/source/index.rst +++ b/docs/readthedocs/source/index.rst @@ -33,7 +33,7 @@ It is built on top of the excellent work of llama.cpp, transfromers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
  • - It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc. + It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
  • 50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here. @@ -47,11 +47,11 @@ Latest update 🔥 * [2024/05] ``ipex-llm`` now supports **Axolotl** for LLM finetuning on Intel GPU; see the quickstart `here `_. * [2024/04] You can now run **Open WebUI** on Intel GPU using ``ipex-llm``; see the quickstart `here `_. * [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_. -* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_. +* [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_. * [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU. * [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. * [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). -* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. +* [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. * [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. * [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. * [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). @@ -62,10 +62,10 @@ Latest update 🔥 :color: primary * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). - * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. + * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). - * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. - * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. + * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. + * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. @@ -197,10 +197,10 @@ Code Examples ============================================ * Low bit inference - * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ - * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ - * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ - * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ + * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ + * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ + * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ + * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ * FP16/BF16 inference @@ -210,9 +210,9 @@ Code Examples * Save and load * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models - * `GGUF `_: directly loading GGUF models into ``ipex-llm`` - * `AWQ `_: directly loading AWQ models into ``ipex-llm`` - * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` + * `GGUF `_: directly loading GGUF models into ``ipex-llm`` + * `AWQ `_: directly loading AWQ models into ``ipex-llm`` + * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` * Finetuning @@ -221,7 +221,7 @@ Code Examples * Integration with community libraries - * `HuggingFace transformers `_ + * `HuggingFace transformers `_ * `Standard PyTorch model `_ * `DeepSpeed-AutoTP `_ * `HuggingFace PEFT `_ @@ -267,8 +267,8 @@ Verified Models link1, link2 - link - link + link + link LLaMA 2 @@ -276,15 +276,15 @@ Verified Models link1, link2 - link - link + link + link LLaMA 3 link - link + link ChatGLM @@ -297,77 +297,77 @@ Verified Models link - link + link ChatGLM3 link - link + link GLM-4 link - link + link GLM-4V link - link + link Mistral link - link + link Mixtral link - link + link Falcon link - link + link MPT link - link + link Dolly-v1 link - link + link Dolly-v2 link - link + link Replit Code link - link + link RedPajama @@ -389,70 +389,70 @@ Verified Models link1, link2 - link + link Baichuan link - link + link Baichuan2 link - link + link InternLM link - link + link Qwen link - link + link Qwen1.5 link - link + link Qwen2 link - link + link Qwen-VL link - link + link Aquila link - link + link Aquila2 link - link + link MOSS @@ -465,21 +465,21 @@ Verified Models link - link + link Phi-1_5 link - link + link Flan-t5 link - link + link LLaVA @@ -493,7 +493,7 @@ Verified Models link - link + link Skywork @@ -530,21 +530,21 @@ Verified Models link - link + link Yi link - link + link BlueLM link - link + link Mamba @@ -558,33 +558,33 @@ Verified Models link - link + link Phixtral link - link + link InternLM2 link - link + link RWKV4 - link + link RWKV5 - link + link Bark @@ -616,84 +616,84 @@ Verified Models link - link + link Phi-3 link - link + link Phi-3-vision link - link + link Yuan2 link - link + link Gemma link - link + link DeciLM-7B link - link + link Deepseek link - link + link StableLM link - link + link CodeGemma link - link + link Command-R/cohere link - link + link CodeGeeX2 link - link + link MiniCPM link - link + link diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/README.md deleted file mode 100644 index ba1b370ad9f..00000000000 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/README.md +++ /dev/null @@ -1,8 +0,0 @@ -# Running HuggingFace `transformers` model using IPEX-LLM on Intel GPU - -This folder contains examples of running any HuggingFace `transformers` model on IPEX-LLM (using the standard AutoModel APIs): - -- [Model](Model): examples of running HuggingFace transformers models (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) using INT4 optimizations -- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (FP8/INT8/FP4, etc.) -- [Save-Load](Save-Load): examples of saving and loading low-bit models -- [Advanced-Quantizations](Advanced-Quantizations): examples of loading GGUF/AWQ/GPTQ models diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/README.md rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ/generate.py rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/AWQ/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/README.md rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF-IQ2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/README.md rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF/generate.py rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GGUF/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/generate.py b/python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/generate.py rename to python/llm/example/GPU/HuggingFace/Advanced-Quantizations/GPTQ/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md b/python/llm/example/GPU/HuggingFace/LLM/README.md similarity index 99% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md rename to python/llm/example/GPU/HuggingFace/LLM/README.md index 4e60a656dd3..6d0a8967b73 100644 --- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/README.md @@ -1,5 +1,2 @@ # IPEX-LLM Transformers INT4 Optimization for Large Language Model on Intel GPUs You can use IPEX-LLM to run almost every Huggingface Transformer models with INT4 optimizations on your laptops with Intel GPUs. This directory contains example scripts to help you quickly get started using IPEX-LLM to run some popular open-source models in the community. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. - - - diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md b/python/llm/example/GPU/HuggingFace/LLM/aquila/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/README.md rename to python/llm/example/GPU/HuggingFace/LLM/aquila/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py b/python/llm/example/GPU/HuggingFace/LLM/aquila/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/aquila/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md b/python/llm/example/GPU/HuggingFace/LLM/aquila2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/aquila2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/aquila2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/aquila2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md b/python/llm/example/GPU/HuggingFace/LLM/baichuan/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/README.md rename to python/llm/example/GPU/HuggingFace/LLM/baichuan/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py b/python/llm/example/GPU/HuggingFace/LLM/baichuan/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/baichuan/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md b/python/llm/example/GPU/HuggingFace/LLM/baichuan2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/baichuan2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/baichuan2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/baichuan2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md b/python/llm/example/GPU/HuggingFace/LLM/bluelm/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/README.md rename to python/llm/example/GPU/HuggingFace/LLM/bluelm/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/bluelm/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/bluelm/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/README.md b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm2/streamchat.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py rename to python/llm/example/GPU/HuggingFace/LLM/chatglm2/streamchat.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/README.md rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/streamchat.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py rename to python/llm/example/GPU/HuggingFace/LLM/chatglm3/streamchat.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md b/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/chinese-llama2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegeex2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegemma/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/README.md rename to python/llm/example/GPU/HuggingFace/LLM/codegemma/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codegemma/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codegemma/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/codegemma/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codellama/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/codellama/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md b/python/llm/example/GPU/HuggingFace/LLM/codellama/readme.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/readme.md rename to python/llm/example/GPU/HuggingFace/LLM/codellama/readme.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/README.md b/python/llm/example/GPU/HuggingFace/LLM/codeshell/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/README.md rename to python/llm/example/GPU/HuggingFace/LLM/codeshell/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/server.py b/python/llm/example/GPU/HuggingFace/LLM/codeshell/server.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/codeshell/server.py rename to python/llm/example/GPU/HuggingFace/LLM/codeshell/server.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/README.md b/python/llm/example/GPU/HuggingFace/LLM/cohere/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/README.md rename to python/llm/example/GPU/HuggingFace/LLM/cohere/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/generate.py b/python/llm/example/GPU/HuggingFace/LLM/cohere/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/cohere/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/cohere/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/README.md b/python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/README.md rename to python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/generate.py b/python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deciLM-7b/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/deciLM-7b/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/README.md b/python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/README.md rename to python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/generate.py b/python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/deepseek/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md b/python/llm/example/GPU/HuggingFace/LLM/dolly-v1/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/README.md rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v1/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py b/python/llm/example/GPU/HuggingFace/LLM/dolly-v1/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v1/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v1/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md b/python/llm/example/GPU/HuggingFace/LLM/dolly-v2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/dolly-v2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/dolly-v2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/dolly-v2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md b/python/llm/example/GPU/HuggingFace/LLM/falcon/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/README.md rename to python/llm/example/GPU/HuggingFace/LLM/falcon/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/falcon-7b-instruct/modelling_RW.py b/python/llm/example/GPU/HuggingFace/LLM/falcon/falcon-7b-instruct/modelling_RW.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/falcon-7b-instruct/modelling_RW.py rename to python/llm/example/GPU/HuggingFace/LLM/falcon/falcon-7b-instruct/modelling_RW.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py b/python/llm/example/GPU/HuggingFace/LLM/falcon/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/falcon/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md b/python/llm/example/GPU/HuggingFace/LLM/flan-t5/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/README.md rename to python/llm/example/GPU/HuggingFace/LLM/flan-t5/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/flan-t5/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/flan-t5/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/README.md b/python/llm/example/GPU/HuggingFace/LLM/gemma/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/README.md rename to python/llm/example/GPU/HuggingFace/LLM/gemma/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py b/python/llm/example/GPU/HuggingFace/LLM/gemma/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/gemma/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/README.md b/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/README.md rename to python/llm/example/GPU/HuggingFace/LLM/glm4/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/generate.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py rename to python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py b/python/llm/example/GPU/HuggingFace/LLM/gpt-j/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/gpt-j/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md b/python/llm/example/GPU/HuggingFace/LLM/gpt-j/readme.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/readme.md rename to python/llm/example/GPU/HuggingFace/LLM/gpt-j/readme.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md b/python/llm/example/GPU/HuggingFace/LLM/internlm/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/README.md rename to python/llm/example/GPU/HuggingFace/LLM/internlm/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/internlm/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/internlm/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/README.md b/python/llm/example/GPU/HuggingFace/LLM/internlm2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/internlm2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/internlm2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/internlm2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md b/python/llm/example/GPU/HuggingFace/LLM/llama2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/llama2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/llama2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/llama2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/README.md b/python/llm/example/GPU/HuggingFace/LLM/llama3/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/README.md rename to python/llm/example/GPU/HuggingFace/LLM/llama3/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/llama3/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/llama3/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/README.md b/python/llm/example/GPU/HuggingFace/LLM/minicpm/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/README.md rename to python/llm/example/GPU/HuggingFace/LLM/minicpm/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/minicpm/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/minicpm/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/minicpm/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md b/python/llm/example/GPU/HuggingFace/LLM/mistral/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/README.md rename to python/llm/example/GPU/HuggingFace/LLM/mistral/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mistral/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/mistral/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md b/python/llm/example/GPU/HuggingFace/LLM/mixtral/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/README.md rename to python/llm/example/GPU/HuggingFace/LLM/mixtral/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mixtral/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/mixtral/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md b/python/llm/example/GPU/HuggingFace/LLM/mpt/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/README.md rename to python/llm/example/GPU/HuggingFace/LLM/mpt/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py b/python/llm/example/GPU/HuggingFace/LLM/mpt/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/mpt/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-1_5/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/README.md rename to python/llm/example/GPU/HuggingFace/LLM/phi-1_5/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-1_5/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/phi-1_5/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/phi-2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/phi-2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/README.md b/python/llm/example/GPU/HuggingFace/LLM/phi-3/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/README.md rename to python/llm/example/GPU/HuggingFace/LLM/phi-3/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phi-3/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/phi-3/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/README.md b/python/llm/example/GPU/HuggingFace/LLM/phixtral/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/README.md rename to python/llm/example/GPU/HuggingFace/LLM/phixtral/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py b/python/llm/example/GPU/HuggingFace/LLM/phixtral/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/phixtral/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/README.md rename to python/llm/example/GPU/HuggingFace/LLM/qwen/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/qwen/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen1.5/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/README.md rename to python/llm/example/GPU/HuggingFace/LLM/qwen1.5/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen1.5/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/qwen1.5/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/README.md b/python/llm/example/GPU/HuggingFace/LLM/redpajama/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/README.md rename to python/llm/example/GPU/HuggingFace/LLM/redpajama/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py b/python/llm/example/GPU/HuggingFace/LLM/redpajama/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/redpajama/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md b/python/llm/example/GPU/HuggingFace/LLM/replit/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/README.md rename to python/llm/example/GPU/HuggingFace/LLM/replit/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py b/python/llm/example/GPU/HuggingFace/LLM/replit/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/replit/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/README.md b/python/llm/example/GPU/HuggingFace/LLM/rwkv4/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/README.md rename to python/llm/example/GPU/HuggingFace/LLM/rwkv4/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py b/python/llm/example/GPU/HuggingFace/LLM/rwkv4/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/rwkv4/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/README.md b/python/llm/example/GPU/HuggingFace/LLM/rwkv5/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/README.md rename to python/llm/example/GPU/HuggingFace/LLM/rwkv5/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/rwkv5/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/rwkv5/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md b/python/llm/example/GPU/HuggingFace/LLM/solar/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/README.md rename to python/llm/example/GPU/HuggingFace/LLM/solar/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py b/python/llm/example/GPU/HuggingFace/LLM/solar/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/solar/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/README.md b/python/llm/example/GPU/HuggingFace/LLM/stablelm/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/README.md rename to python/llm/example/GPU/HuggingFace/LLM/stablelm/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/generate.py b/python/llm/example/GPU/HuggingFace/LLM/stablelm/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/stablelm/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/stablelm/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py b/python/llm/example/GPU/HuggingFace/LLM/starcoder/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/starcoder/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md b/python/llm/example/GPU/HuggingFace/LLM/starcoder/readme.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/readme.md rename to python/llm/example/GPU/HuggingFace/LLM/starcoder/readme.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md b/python/llm/example/GPU/HuggingFace/LLM/vicuna/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/README.md rename to python/llm/example/GPU/HuggingFace/LLM/vicuna/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py b/python/llm/example/GPU/HuggingFace/LLM/vicuna/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/vicuna/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/README.md rename to python/llm/example/GPU/HuggingFace/LLM/yi/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/yi/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md b/python/llm/example/GPU/HuggingFace/LLM/yuan2/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yuan2/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json b/python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/config.json similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/config.json diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py b/python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/yuan_hf_model.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py rename to python/llm/example/GPU/HuggingFace/LLM/yuan2/yuan2-2B-instruct/yuan_hf_model.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md b/python/llm/example/GPU/HuggingFace/More-Data-Types/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/README.md rename to python/llm/example/GPU/HuggingFace/More-Data-Types/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py b/python/llm/example/GPU/HuggingFace/More-Data-Types/transformers_low_bit_pipeline.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/transformers_low_bit_pipeline.py rename to python/llm/example/GPU/HuggingFace/More-Data-Types/transformers_low_bit_pipeline.py diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/README.md new file mode 100644 index 00000000000..56a57505d49 --- /dev/null +++ b/python/llm/example/GPU/HuggingFace/Multimodal/README.md @@ -0,0 +1,3 @@ +# Running HuggingFace multimodal model using IPEX-LLM on Intel GPU + +This folder contains examples of running multimodal models model on IPEX-LLM. Each model has its own dedicated folder, where you can find detailed instructions on how to install and run it. diff --git a/python/llm/example/GPU/StableDiffusion/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/README.md similarity index 100% rename from python/llm/example/GPU/StableDiffusion/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/README.md diff --git a/python/llm/example/GPU/StableDiffusion/lora-lcm.py b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/lora-lcm.py similarity index 100% rename from python/llm/example/GPU/StableDiffusion/lora-lcm.py rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/lora-lcm.py diff --git a/python/llm/example/GPU/StableDiffusion/sdxl.py b/python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/sdxl.py similarity index 100% rename from python/llm/example/GPU/StableDiffusion/sdxl.py rename to python/llm/example/GPU/HuggingFace/Multimodal/StableDiffusion/sdxl.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/recognize.py b/python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/recognize.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/distil-whisper/recognize.py rename to python/llm/example/GPU/HuggingFace/Multimodal/distil-whisper/recognize.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm-4v/generate.py rename to python/llm/example/GPU/HuggingFace/Multimodal/glm-4v/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-3-vision/generate.py rename to python/llm/example/GPU/HuggingFace/Multimodal/phi-3-vision/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/chat.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl/chat.py rename to python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/chat.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/README.md rename to python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/voiceassistant/generate.py rename to python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/generate.py diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/readme.md rename to python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/recognize.py b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/recognize.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper/recognize.py rename to python/llm/example/GPU/HuggingFace/Multimodal/whisper/recognize.py diff --git a/python/llm/example/GPU/HuggingFace/README.md b/python/llm/example/GPU/HuggingFace/README.md new file mode 100644 index 00000000000..8dbae40a24f --- /dev/null +++ b/python/llm/example/GPU/HuggingFace/README.md @@ -0,0 +1,9 @@ +# Running HuggingFace models using IPEX-LLM on Intel GPU + +This folder contains examples of running any HuggingFace model on IPEX-LLM: + +- [LLM](LLM): examples of running large language models (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) using IPEX-LLM optimizations +- [Multimodal](Multimodal): examples of running large multimodal models (StableDiffusion models, Qwen-VL-Chat, glm-4v, etc.) using IPEX-LLM optimizations +- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (FP8/INT8/FP4, etc.) +- [Save-Load](Save-Load): examples of saving and loading low-bit models +- [Advanced-Quantizations](Advanced-Quantizations): examples of loading GGUF/AWQ/GPTQ models diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/README.md b/python/llm/example/GPU/HuggingFace/Save-Load/README.md similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/README.md rename to python/llm/example/GPU/HuggingFace/Save-Load/README.md diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py b/python/llm/example/GPU/HuggingFace/Save-Load/generate.py similarity index 100% rename from python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py rename to python/llm/example/GPU/HuggingFace/Save-Load/generate.py diff --git a/python/llm/example/GPU/README.md b/python/llm/example/GPU/README.md index ab13bf95485..dc7600c6bf0 100644 --- a/python/llm/example/GPU/README.md +++ b/python/llm/example/GPU/README.md @@ -3,7 +3,7 @@ This folder contains examples of running IPEX-LLM on Intel GPU: - [Applications](Applications): running LLM applications (such as autogen) on IPEX-LLM -- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any ***Hugging Face Transformers*** model on IPEX-LLM (using the standard AutoModel APIs) +- [HuggingFace](HuggingFace): running ***HuggingFace*** models on IPEX-LLM (using the standard AutoModel APIs), including language models and multimodal models. - [LLM-Finetuning](LLM-Finetuning): running ***finetuning*** (such as LoRA, QLoRA, QA-LoRA, etc) using IPEX-LLM on Intel GPUs - [vLLM-Serving](vLLM-Serving): running ***vLLM*** serving framework on intel GPUs (with IPEX-LLM low-bit optimized models) - [Deepspeed-AutoTP](Deepspeed-AutoTP): running distributed inference using ***DeepSpeed AutoTP*** (with IPEX-LLM low-bit optimized models) on Intel GPUs @@ -15,7 +15,6 @@ This folder contains examples of running IPEX-LLM on Intel GPU: - [Speculative-Decoding](Speculative-Decoding): running any ***Hugging Face Transformers*** model with ***self-speculative decoding*** on Intel GPUs - [ModelScope-Models](ModelScope-Models): running ***ModelScope*** model with IPEX-LLM on Intel GPUs - [Long-Context](Long-Context): running **long-context** generation with IPEX-LLM on Intel Arc™ A770 Graphics. -- [StableDiffusion](StableDiffusion): running **stable diffusion** with IPEX-LLM on Intel GPUs. ## System Support