Based on llama.cpp build 7371.
See SCRIPT_llama_bench.sh for llama-bench configuration and SCRIPT_launch_server_MI50.sh for server launch settings.
The core modifications are implemented in ggml-cuda/gfx906 folder.
mmq.cuh Software pipelining for Q8_0 MMQ loads
mmq.cuh Optimized Q8 MMQ need_check path to avoid LDS conflicts
mmq.cuh MXFP4 load pipeline with e8m0 conversion optimization
vecdotq.cuh Fast Q8_0 load path using memcpy
vecdotq.cuh Software pipeline MXFP4 MMVQ for v_perm latency hiding
vecdotq.cuh MXFP4 lookup with 2-perm + arithmetic sign
mmq.cu/mmid.cu MoE sub-warp shuffle fix for wavefront64 (fixes gpt-oss loading problems)
common.cuh DPP-based warp reductions with unified shuffle XOR dispatch
fattn-common.cuh GCN-optimized thread counts and tile configurations
fattn.cu Q8-optimized tile kernel selection for GFX906 flash attention
mmq.cu Integrated GFX906 vectorized loads for Q4_0/Q4_1 quantizations
gfx906/ New directory with MI50/MI60-specific kernel implementations
Optional but sometimes required, set your paths for rocm and device libs if they are not in /opt/rocm/
export ROCM_PATH=/opt/rocm-7.1.0 #optional
export HIP_DEVICE_LIB_PATH=/opt/rocm-7.1.0/amdgcn/bitcode #optionalgit clone https://github.com/iacopPBK/llama.cpp-gfx906.git
cd llama.cpp-gfx906
./SCRIPT_compile_MI50.sh # edit ROCM_PATH if not using /opt/rocm
./SCRIPT_launch_server_MI50.sh # edit MODEL_PATH to your model file
./SCRIPT_llama_bench.sh # edit MODEL_PATH to your model file, performs the bench shown above
Tested with ROCm 7.1.1 and GFX906 GPU (MI50/MI60).
Performance scales with power limit using SCRIPT_overclock_upp_MI50.sh for MI50 overclocking via UPP (Powerplay Table Editor). Results gathered using 2511 release.
Props to these users for spending time on the repo.
@fuutott ・ @mircoboschi ・ @skyne98
AMD GCN ISA ・ llama.cpp ・ ROCm ・ GFX906 DISCORD ・ wiki-gfx906 ・ llama-labs-gfx906
Built for the GFX906 community