Skip to content

iacopPBK/llama.cpp-gfx906

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp-gfx906-2512

Based on llama.cpp build 7371.

Benchmark Results

Benchmark Results

See SCRIPT_llama_bench.sh for llama-bench configuration and SCRIPT_launch_server_MI50.sh for server launch settings.

What Changed

The core modifications are implemented in ggml-cuda/gfx906 folder.

2512

mmq.cuh              Software pipelining for Q8_0 MMQ loads
mmq.cuh              Optimized Q8 MMQ need_check path to avoid LDS conflicts
mmq.cuh              MXFP4 load pipeline with e8m0 conversion optimization
vecdotq.cuh          Fast Q8_0 load path using memcpy
vecdotq.cuh          Software pipeline MXFP4 MMVQ for v_perm latency hiding
vecdotq.cuh          MXFP4 lookup with 2-perm + arithmetic sign
mmq.cu/mmid.cu       MoE sub-warp shuffle fix for wavefront64 (fixes gpt-oss loading problems)

2511

common.cuh           DPP-based warp reductions with unified shuffle XOR dispatch
fattn-common.cuh     GCN-optimized thread counts and tile configurations
fattn.cu             Q8-optimized tile kernel selection for GFX906 flash attention
mmq.cu               Integrated GFX906 vectorized loads for Q4_0/Q4_1 quantizations
gfx906/              New directory with MI50/MI60-specific kernel implementations

Quick Start

Optional but sometimes required, set your paths for rocm and device libs if they are not in /opt/rocm/

export ROCM_PATH=/opt/rocm-7.1.0 #optional
export HIP_DEVICE_LIB_PATH=/opt/rocm-7.1.0/amdgcn/bitcode #optional
git clone https://github.com/iacopPBK/llama.cpp-gfx906.git
cd llama.cpp-gfx906
./SCRIPT_compile_MI50.sh      # edit ROCM_PATH if not using /opt/rocm
./SCRIPT_launch_server_MI50.sh # edit MODEL_PATH to your model file
./SCRIPT_llama_bench.sh # edit MODEL_PATH to your model file, performs the bench shown above

Tested with ROCm 7.1.1 and GFX906 GPU (MI50/MI60).

Power Scaling

Performance scales with power limit using SCRIPT_overclock_upp_MI50.sh for MI50 overclocking via UPP (Powerplay Table Editor). Results gathered using 2511 release.

PP Performance

TG Performance

Special Thanks and Links

Props to these users for spending time on the repo.

@fuutott@mircoboschi@skyne98


AMD GCN ISAllama.cppROCmGFX906 DISCORDwiki-gfx906llama-labs-gfx906

Built for the GFX906 community

Packages

 
 
 

Languages

  • C++ 55.9%
  • C 12.2%
  • Python 7.9%
  • Cuda 6.9%
  • HTML 4.7%
  • Metal 2.1%
  • Other 10.3%