- Overview
- Documentation
- Quick start
- Supported Models
- Supported Platforms
- Framework specific configuration options
- Shared libraries build parameter
- Known Issue with llama.cpp
- llama cpp model
- onnxruntime genai model
- mediapipe model
- mnn model
- To build an executable benchmark binary
- arm llm benchmark
- Troubleshooting
- Contributions
- Trademarks
- License
This repo is designed for building an Arm® KleidiAI™ enabled LLM library using CMake build system. Provides a single API (Java & C++) to various LLM frameworks that Arm® KleidiAI™ kernels have been integrated into. Currently, it supports llama.cpp, mediapipe, onnxruntime-genai, and MNN backends. The backend library (selected at CMake configuration stage) is wrapped by this project's thin C++ layer that could be used directly for testing and evaluations. However, JNI bindings are also provided for developers targeting Android™ based applications.
This guide covers the recommended build and run flows for supported platforms. For configuration options and model details, see docs/build_and_config_guide.md.
- A Linux®-based operating system is recommended (this repo is tested on Ubuntu® 22.04.4 LTS)
- An Android™ or Linux® device with an Arm® CPU is recommended as a deployment target, but this library can be built for any native machine.
- CMake 3.28 or above installed
- Python 3.9 or above installed, python is used to download test resources and models
- Android™ NDK (if building for Android™). Minimum version: 29.0.14206865 is recommended and can be downloaded from here.
- Building on macOS requires Xcode Command Line Tools, Android Studio installed and configured (NDK, CMake as above) and Clang (tested with 16.0.0)
- Bazelisk or Bazel 7.4.1 to build mediapipe backend
- Aarch64 GNU toolchain (version 14.1 or later) if cross-compiling from a Linux® based system which can be downloaded from here
- Java Development Kit required for building JNI wrapper library necessary to utilise this module in an Android/Java application.
- Create a Hugging Face account and obtain a Hugging Face access token.
The project can be built and LLM tests exercised by simply running the following commands on supported platforms:
cmake --preset=native -B build
cmake --build ./build
ctest --test-dir ./buildThe commands above will use the default LLM framework (llama.cpp) and download a small number of LLM models. The tests exercise both vision and text queries. See LlmTest.cpp & LlmTestJNI.java for details.
ctest --test-dir ./build command above should produce results similar to those give below (timings may vary):
Internal ctest changing into directory: /home/user/llm/build
Test project /home/user/llm/build
Start 1: llm-cpp-ctest
1/2 Test #1: llm-cpp-ctest .................... Passed 4.16 sec
Start 2: llama-jni-ctest
2/2 Test #2: llama-jni-ctest .................. Passed 3.25 sec
100% tests passed, 0 tests failed out of 2| Document | Purpose |
|---|---|
docs/README.md |
Documentation index and update guidance. |
docs/build_and_config_guide.md |
Build/run steps, platform matrix, and common build commands. |
docs/architecture.md |
Architecture overview, components, and execution flow. |
docs/benchmarking.md |
Benchmarking and profiling guidance. |
docs/integration.md |
Configuration options and integration notes. |
docs/troubleshooting.md |
Common errors and fixes. |
docs/contributing.md |
Contribution process and SPDX guidance. |
| Source Folder | Purpose |
|---|---|
src/cpp/ |
Core C++ wrapper implementing the LLM-Runner abstraction layer and backend integration. |
src/java/ |
Java/JNI bindings. |
scripts/py/ |
Python utilities for downloading models, test resources, and performing data preparation tasks. |
scripts/cmake/ |
Toolchains and CMake helper scripts for cross-compilation and platform configuration. |
model_configuration_files/ |
Model configuration files used by the build system and runtime. |
resources_downloaded/ |
Default directory where models and example assets are downloaded. |
test/ |
C++/Java unit tests and supporting test resources. |
| Framework / Backend | Supported Models | Licenses |
|---|---|---|
| llama.cpp | phi-2qwen-2-VLllama-3.2-1B |
mit apache-2.0 Llama-3.2-1B |
| onnxruntime-genai | phi4-mini-instruct |
mit |
| mediapipe | gemma-2B |
Gemma |
| mnn | qwen-2.5-VLllama-3.2-1B |
apache-2.0 Llama-3.2-1B |
The supported build platforms and cmake presets matrix is given below. The cmake presets (aka build target) are given in the first column and build platform are given in the first row. So for example native builds have been tested on Linux-x86_64, Linux-aarch64 & macOS-aarch64. While x-android-aarch64 (targets Android™ devices running on aarch64) builds are only tested on Linux-x86_64 & macOS-aarch64.
| cmake-preset / Host Platform | Linux-x86_64 | Linux-aarch64 | macOS-aarch64 | Android™ |
|---|---|---|---|---|
| native | ✅ | ✅ * | ✅ | - |
| x-android-aarch64 | ✅ | - | ✅ | - |
| x-linux-aarch64 | ✅ * | ✅ † | - | - |
* When targeting the Linux-aarch64 platform and the llama.cpp backend (using either native or x-linux-aarch64 presets) CPU_ARCH build flag must be specified. See the CPU_ARCH table for supported configuration. † Use 'native' preset
Configuration option can be used with cmake presets.
For example KleidiAI acceleration can be disabled by setting USE_KLEIDIAI=OFF, e.g. This is useful when testing the uplift in performance due to Arm CPU hardware acceleration.
cmake --preset=native -B build -DUSE_KLEIDIAI=OFF
cmake --build ./build
ctest --test-dir ./buildLLM_FRAMEWORK can be used to select the LLM framework, e.g.
cmake --preset=native -B build -DLLM_FRAMEWORK=onnxruntime-genai
cmake --build ./build
ctest --test-dir ./buildDetails of configurable build options are given below:
| Flag name | Default | Values | Description |
|---|---|---|---|
| LLM_FRAMEWORK | llama.cpp | llama.cpp / mediapipe / onnxruntime-genai / mnn | Specifies the backend framework to be used. |
| BUILD_DEBUG | OFF | ON/OFF | If set to ON a debug build is configured. |
| ENABLE_STREAMLINE | OFF | ON/OFF | Enables Arm Streamline timeline annotations for analyzing LLM initialization, encode, decode, and control-path performance. |
| BUILD_LLM_TESTING | ON | ON/OFF | Builds the project's functional tests when ON. |
| BUILD_BENCHMARK | OFF | ON/OFF | Builds the framework's benchmark binaries and arm-llm-bench-cli for the project when ON. |
| BUILD_JNI_LIB | ON | ON/OFF | Builds the JNI bindings for the project. |
| LOG_LEVEL | INFO/DEBUG | DEBUG, INFO, WARN & ERROR | For BUILD_DEBUG=OFF the default value is INFO. For BUILD_DEBUG=ON, the default value is DEBUG. |
| USE_KLEIDIAI | ON | ON/OFF | Build the project with KLEIDIAI CPU optimizations; if set to OFF, optimizations are turned off. |
| CPU_ARCH | Not defined | Armv8.2_1, Armv8.2_2, Armv8.2_3, Armv8.2_4, Armv8.2_5, Armv8.6_1, Armv9.0_1_1, armv9.2_1_1, armv9.2_2_1 | Sets the target ISA architecture (AArch64) to ensure SVE is not enabled when LLM_FRAMEWORK=llama.cpp (issue affects aarch64 only). |
| GGML_METAL | OFF | ON/OFF | macOS specific. Enables Apple Metal backend in ggml for GPU acceleration (Apple Silicon only). |
| GGML_BLAS | OFF | ON/OFF | macOS specific. Enables Accelerate/BLAS backend in ggml for CPU-optimized linear algebra kernels. |
DOWNLOADS_LOCK_TIMEOUT: A timeout value in seconds indicating how much time a lock should be tried for when downloading resources. This is a one-time download that CMake configuration will initiate unless it has been run by the user directly or another prior CMake configuration. The lock prevents multiple CMake configuration processes running in parallel downloading files to the same location.
There are different conditional options for different frameworks.
For llama.cpp as framework, these configuration parameters can be set:
LLAMA_SRC_DIR: Source directory path that will be populated by CMake configuration.LLAMA_GIT_URL: Git URL to clone the sources from.LLAMA_GIT_SHA: Git SHA for checkout.LLAMA_BUILD_COMMON: Build llama's dependency Common, enabled by default.LLAMA_CURL: Enable HTTP transport via libcurl for remote models or features requiring network communication, disabled by default.
When using onnxruntime-genai, the onnxruntime dependency will be built from source. To customize
the versions of both onnxruntime and onnxruntime-genai, the following configuration parameters
can be used:
onnxruntime:
ONNXRUNTIME_SRC_DIR: Source directory path that will be populated by CMake configuration.ONNXRUNTIME_GIT_URL: Git URL to clone the sources from.ONNXRUNTIME_GIT_TAG: Git SHA for checkout.
onnxruntime-genai:
ONNXRT_GENAI_SRC_DIR: Source directory path that will be populated by CMake configuration.ONNXRT_GENAI_GIT_URL: Git URL to clone the sources from.ONNXRT_GENAI_GIT_TAG: Git SHA for checkout.
NOTE: This repository has been tested with
onnxruntimeversionv1.24.2andonnxruntime-genaiversionv0.12.0.
For customising mediapipe framework , following parameters can be used:
MEDIAPIPE_SRC_DIR: Source directory path that will be populated by CMake configuration.MEDIAPIPE_GIT_URL: Git URL to clone the sources from.MEDIAPIPE_GIT_TAG: Git SHA for checkout
Building mediapipe for aarch64 in x86_64 linux based requires downloading Aarch64 GNU toolchain from here, following configuration flags need to provided for building
BASE_PATH: Provides the top level directory of aarch64 GNU toolchain, if not provided the build script will download the latest ARM GNU toolchain for cross-compilation.
NOTE: Support for mediapipe is experimental and current focus is to support Android™ platform. Please note that latest ARM GNU Toolchain version(14.3) may depend on libraries present in Ubuntu® 24.04.4 LTS when cross-compiled.
Support for macOS® and Windows is not added in this release.
For customising MNN framework , following parameters can be used:
MNN_SRC_DIR: Source directory path that will be populated by CMake configuration.MNN_GIT_URL: Git URL to clone the sources from.MNN_GIT_TAG: Git SHA for checkout
NOTE: This repository has been tested with
MNNversionv3.3.0.
KleidiAI™ NOTE: : Although MNN can be built with USE_KLEIDIAI defined, the current MNN implementation does not fully enable KleidiAI™ optimizations at runtime. This limitation is due to the current MNN runtime initialization logic and will be resolved once full support is implemented upstream in MNN.
When targeting the llama.cpp LLM backend and Android (--preset=x-android-aarch64), BUILD_SHARED_LIBS=ON is automatically configured. This ensures the build generates shared libraries, allowing the optimal hardware accelerated libraries to be loaded for the particular device at runtime.
Currently there are issues with a specific architecture (SVE) integration in llama.cpp backend on aarch64. To ensure this feature is not enabled we enforce using one of our provided CPU_ARCH flag presets
that ensure compiler flags do not enable SVE at build time.
The table below gives the mapping of our preset CPU_ARCH flags to some common CPU feature flag sets.
Other permutations are also supported and can be tailored accordingly. If you intend to use specific features you must ensure your specific CPU implements them e.g. i8mm as this was
optional in v8.2 for example. Compilers also need to support any chosen features.
| CPU_ARCH | C/C++ compiler flags |
|---|---|
| Armv8.2_1 | -march=armv8.2-a+dotprod |
| Armv8.2_2 | -march=armv8.2-a+dotprod+fp16 |
| Armv8.2_3 | -march=armv8.2-a+dotprod+fp16+sve |
| Armv8.2_4 | -march=armv8.2-a+dotprod+i8mm |
| Armv8.2_5 | -march=armv8.2-a+dotprod+i8mm+sve+sme |
| Armv8.6_1 | -march=armv8.6-a+dotprod+fp16+i8mm |
| Armv9.0_1_1 | -march=armv8.6-a+dotprod+fp16+i8mm+nosve |
| *armv9.2_1_1 | -march=armv9.2-a+dotprod+fp16+nosve+i8mm+sme |
| *armv9.2_2_1 | -march=armv9.2-a+dotprod+fp16+nosve+i8mm+sme |
- Note: Different capitalisation for v9.2 presets.
NOTE: If you need specific version of Java set the path in
JAVA_HOMEenvironment variable.export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64Failure to locate "jni.h" occurs if compatible JDK is not on the system path. If you want to experiment with the repository without JNI libs, turn the
BUILD_JNI_LIBoption off by configuring with-DBUILD_JNI_LIB=OFF. On first LLM initialization, the module also emits a build metadata line to CLI logs and Android logcat containing the selected backend, pinned backend dependency revisions, module version/git SHA, and build timestamp.
This project uses the phi-2 model as its default network for llama.cpp framework.
The model is distributed using the Q4_0 quantization format, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.
- You can access the model from Hugging Face.
- The default model configuration is declared in the
requirements.jsonfile.
However, any model supported by the backend library could be used.
NOTE: Currently only Q4_0 models are accelerated by Arm® KleidiAI™ kernels in
llama.cpp.
The llama.cpp backend also supports multimodal (image + text) inference in this project.
What you need
- A compatible text model (GGUF).
- A matching vision projection (mmproj) file (GGUF) for your chosen text model
How to enable Use these fields in your configuration file:
llmModelName— text model (GGUF)llmMmProjModelName— vision projection (GGUF) for multimodalisvision— set"true"to enable multimodal
If "isVision" is true, a valid llmMmProjModelName is required; omitting "image" runs the backend in text-only mode.
You can find an example of multimodal settings in llamaVisionConfig-qwen2-vl-2B.json.
This project uses the Phi-4-mini-instruct-onnx as its default network for onnxruntime-genai framework.
The model is distributed using int4 quantization format with the block size: 32, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.
- You can access the model from Hugging Face.
- The default model configuration is declared in the
requirements.jsonfile.
However, any model supported by the backend library could be used.
To use an ONNX model with this framework, the following files are required:
genai_config.json: Configuration filemodel_name.onnx: ONNX modelmodel_name.onnx.data: ONNX model datatokenizer.json: Tokenizer filetokenizer_config.json: Tokenizer config file
These files are essential for loading and running ONNX models effectively.
NOTE: Currently only int4 and block size 32 models are accelerated by Arm® KleidiAI™ kernels in
onnxruntime-genai.
To use the Gemma 2B model, add your Hugging Face access token to the build environment after accepting the Gemma license .
export HF_TOKEN=<your hugging-face access token>or
Append the following lines to your ~/.netrc file:
machine huggingface.co
login <your-username-or-email>
password <your-huggingface-access-token>
Ensure the .netrc file is secured with the correct permissions.
Alternatively, you can quantize other models listed in conversion colab from Hugging Face to TensorFlow Lite™ (.tflite) format. Copy the resulting 4-bit models to resources_downloaded/models/mediapipe.
It is recommended to use mediapipe python package version 0.10.15 for stable conversion to 4-bit models.
This project uses the Llama 3.2 1B model as its default network for the MNN framework. The model is distributed using the 4-bit quantization format, which is highly recommended as it delivers efficient inference performance while maintaining strong text generation quality on Arm® CPUs.
- You can access the text model from Hugging Face
- The model configuration is declared in the
requirements.json
However, any model supported by the MNN backend library can be used.
To use an MNN model with this framework, the following files are required:
config.json: Model configuration filellm.mnn: Main MNN model filellm.mnn.json: Model metadata file generated by the MNN conversion processllm.mnn.weight: Model weight file (used when weights are stored separately)llm_config.json: Model-specific configuration filetokenizer.txt: Tokenizer definition fileembeddings_bf16.bin: (optional) Used by some models that store embeddings separately. If this file exists, download it; otherwise, embeddings are already included in the main weights.
These files are essential for loading and running MNN models effectively.
The MNN backend also supports multimodal (image + text) inference in this project.
- You can access the vision model from Hugging Face
What you need
visual.mnn: Vision model metadata file generated by the MNN conversion processvisual.mnn.weight: Vision model weight file (used when weights are stored separately)
NOTE: The MNN backend determines whether multimodal mode is active from the
is_visualfield inside the model’sllm_config.json.
You can find an example multimodal configuration in mnnVisionConfig-qwen2.5-3B.json
To build a standalone benchmark binary add the configuration option -DBUILD_BENCHMARK=ON
to any of the build commands above. For example:
On Aarch-64
cmake -B build --preset=native -DCPU_ARCH=Armv8.2_4 -DBUILD_BENCHMARK=ON
cmake --build ./buildThe benchmark summary and JSON output report model_size as a formatted value e.g. 1.23 GB.
The size is derived from the total configured model package for the benchmarked model path.
If the configured path is a directory, the size is computed recursively.
The Arm LLM Benchmark tool (arm-llm-bench-cli) is a framework-agnostic, standalone executable designed to measure both prompt-processing and token-generation performance across all supported LLM backends.
Supported Frameworks
llama.cpponnxruntime-genaiMNNmediapipe
Instead of writing your own prompts or relying on framework-specific benchmarking tools, arm-llm-bench-cli provides a unified benchmarking pipeline. It automatically detects the backend specified in the LLM configuration file and benchmarks it consistently. The tool repeatedly runs the LLM prompt-processing and token-generation operations and reports timing and throughput metrics in a standardized format.
NOTE: To build
arm-llm-bench-cli, ensure the benchmarking flag is set in CMake by setting-DBUILD_BENCHMARK=ON.
Measures
Encode time and encode tokens/sDecode time and decode tokens/sTime-to-first-token (TTFT)Total latency per iterationSupports warm-up iterations (ignored in statistics)
Usage
./build/bin/arm-llm-bench-cli \
--model <model_path> | -m <model_path> \
--input <tokens> | -i <tokens> \
--output <tokens> | -o <tokens> \
--threads <num_threads> | -t <num_threads> \
--iterations <num_iterations> | -n <num_iterations> \
[ --context <tokens> | -c <tokens> ] \
[ --json-output <path> | -J <path> ] \
[ --warmup <warmup_iterations> | -w <warmup_iterations> ]
NOTE: On-device execution requires that
arm-llm-bench-cliand its backend shared libraries reside in the same directory. Builds usingGGML_OPENMP=ONadditionally requirelibomp.soto be placed in that directory as well.
Example
./build/bin/arm-llm-bench-cli \
-m ./resources_downloaded/models/llama.cpp/llama-3.2-1b/Llama-3.2-1B-Instruct-Q4_0.gguf \
-i 128 \
-o 64 \
-c 2048 \
-t 4 \
-n 3 \
-w 1 \
-J /path/to/result.json
Terminal Output:
INFO : Running 1 warmup iteration(s) (results ignored)...
=== ARM LLM Benchmark ===
Parameters:
model_path : ./resources_downloaded/models/llama.cpp/llama-3.2-1b/Llama-3.2-1B-Instruct-Q4_0.gguf
model_size : 0.77 GB
num_input_tokens : 128
num_output_tokens : 64
context_size : 2048
num_threads : 4
num_iterations : 3
num_warmup : 1
======= Results =========
| Framework | Threads | Test | Performance |
| ------------------ | ------- | ------ | -------------------------- |
| llama.cpp | 5 | pp128 | 204.149 ± 4.316 (t/s) |
| llama.cpp | 5 | tg64 | 48.029 ± 0.080 (t/s) |
| llama.cpp | 5 | TTFT | 648.401 ± 13.798 (ms) |
| llama.cpp | 5 | Total | 1959.827 ± 14.433 (ms) |
JSON output written to: /path/to/result.json
For a list of common errors and their fixes, see docs/troubleshooting.md.
The LLM-Runner welcomes contributions. For more details on contributing to the repo please see the contributors guide.
- Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or elsewhere.
- Android™ and TensorFlow Lite™ are trademarks of Google LLC.
- macOS® is a trademark of Apple Inc.
This project is distributed under the software licenses in LICENSES directory. The licenses of supported models can be seen in Supported Models section.