This page describes the flow for running LLMs on an AMD NPU using PyTorch. This is a general-purpose flow providing functional support for a broad set of LLMs. It is intended for prototyping and early development activities. This flow is not optimized for performance and should not be used for benchmarking purposes.
For benchmarking and deployment purposes, a set of performance-optimized models is available upon request on the AMD secure download site: Optimized LLMs on RyzenAI
The following models are supported on RyzenAI with the 4 quantization recipes described in here.
Model Name | SmoothQuant | AWQ | AWQPlus | PerGroup | Quant Model Size |
---|---|---|---|---|---|
facebook/opt-125m | ✓ | ✓ | ✓ | ✓ | 0.07 |
facebook/opt-1.3b | ✓ | ✓ | ✓ | ✓ | 0.8 |
facebook/opt-2.7b | ✓ | ✓ | ✓ | ✓ | 1.4 |
facebook/opt-6.7b | ✓ | ✓ | ✓ | ✓ | 3.8 |
facebook/opt-13b | ✓ | ✓ | ✓ | 7.5 | |
llama-2-7b* | ✓ | ✓ | 3.9 | ||
llama-2-7b-chat* | ✓ | ✓ | ✓ | ✓ | 3.9 |
llama-2-13b* | ✓ | 7.2 | |||
llama-2-13b-chat* | ✓ | ✓ | ✓ | 7.2 | |
Meta-Llama-3-8B-Instruct * | ✓ | 4.8 | |||
bigcode/starcoder | ✓ | ✓ | ✓ | 8.0 | |
code-llama-2-7b* | ✓ | ✓ | ✓ | 3.9 | |
codellama/CodeLlama-7b-hf | ✓ | ✓ | ✓ | 3.9 | |
codellama/CodeLlama-7b-instruct-hf | ✓ | ✓ | ✓ | 3.9 | |
google/gemma-2b ** | ✓ | ✓ | ✓ | 1.2 | |
google/gemma-7b ** | ✓ | ✓ | ✓ | 4.0 | |
THUDM/chatglm-6b | ✓ | 3.3 | |||
THUDM/chatglm3-6b | ✓ | ✓ | ✓ | 4.1 |
The above list is a just representative collection of models supported using the transformers.
📌 Important
* Need local weights for these models.
** Needs transformers==4.39.1 ;
pip install transformers==4.39.1
and follow samerun_awq.py
commands.
Create conda environment:
cd <transformers>
set TRANSFORMERS_ROOT=%CD%
conda env create --file=env.yaml
conda activate ryzenai-transformers
build_dependencies.bat
AWQ Model zoo has precomputed scales, clips and zeros for various LLMs including OPT, Llama. Get the precomputed results:
git lfs install
cd %TRANSFORMERS_ROOT%\ext
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
On Command Prompt
@REM use any unused drive letter, Z: for example
subst Z: %cd%
@REM switch to the Z: drive
Z:
You can remove the virtual drive with:
On Command Prompt
subst /d Z:
On Anaconda Command Prompt
## For PHX
.\setup_phx.bat
## For STX
.\setup_stx.bat
On Anaconda PowerShell
## For PHX
.\setup_phx.ps1
## For STX
.\setup_stx.ps1
Remember to setup the target environment again if you switch to or from a virtual drive!
pip install ops\cpp --force-reinstall
pip install ops\torch_cpp --force-reinstall
When using locally downloaded weights, pass the model directory name as the argument to model_name. Only certain model names are supported by default, make sure the model directory name matches the supported model name.
cd %TRANSFORMERS_ROOT%\models\llm
📌 w8a16
only supported on STX
python run_smoothquant.py --help
# CPU - bf16
python run_smoothquant.py --model_name llama-2-7b --task benchmark --target cpu --precision bf16
# AIE (w8a16 only supported on STX)
python run_smoothquant.py --model_name llama-2-7b --task quantize
python run_smoothquant.py --model_name llama-2-7b --task benchmark --target aie --precision w8a8
python run_smoothquant.py --model_name llama-2-7b --task benchmark --target aie --precision w8a16
python run_smoothquant.py --model_name llama-2-7b --task benchmark_long --target aie
python run_smoothquant.py --model_name llama-2-7b --task decode --target aie
python run_smoothquant.py --model_name llama-2-7b --task perplexity --target aie
python run_awq.py --help
# CPU
python run_awq.py --model_name llama-2-7b-chat --task benchmark --target cpu --precision bf16
# AIE
python run_awq.py --model_name llama-2-7b-chat --task quantize
python run_awq.py --model_name llama-2-7b-chat --task benchmark --target aie
python run_awq.py --model_name llama-2-7b-chat --task benchmark --target aie --flash_attention
python run_awq.py --model_name llama-2-7b-chat --task benchmark --target aie --flash_attention --fast_mlp
python run_awq.py --model_name llama-2-7b-chat --task quantize
python run_awq.py --model_name llama-2-7b-chat --task decode --target aie
python run_awq.py --model_name llama-2-7b-chat --task quantize
python run_awq.py --model_name llama-2-7b-chat --task decode --target aie
Note: Know issue related to kernel driver shows up when using --fast_mlp.
python run_awq.py --model_name llama-2-7b-chat --task quantize --algorithm awqplus
python run_awq.py --model_name llama-2-7b-chat --task decode --algorithm awqplus
python run_awq.py --model_name llama-2-7b-chat --task quantize --algorithm pergrp
python run_awq.py --model_name llama-2-7b-chat --task decode --algorithm pergrp
xcopy /f /y %CONDA_PREFIX%\Lib\site-packages\transformers\models\llama\modeling_llama.py modeling_llama_bak.py
xcopy /f /y modeling_llama.py %CONDA_PREFIX%\Lib\site-packages\transformers\models\llama
python run_awq.py --model_name llama-2-7b --task profilemodel --fast_attention --profile_layer True
xcopy /f /y modeling_llama_bak.py %CONDA_PREFIX%\Lib\site-packages\transformers\models\llama\modeling_llama.py
Note: Each run generates a log file in ./logs
directory with name log_<model_name>.log
.
The time-to-first-token and token-time calculation is described in the below figure.