Fence Inference Engine

Low-level CUDA/C++ LLM inference engine using GGUF quantised models. No external ML frameworks — just raw CUDA kernels.

Supported Architectures

Architecture	Chat Format	Example Models
Qwen3	ChatML	Qwen3-4B, Qwen3-8B, Qwen3-14B
Qwen2	ChatML	Qwen2-7B, Qwen2.5-7B
LLaMA 3 / 3.1 / 3.2	LLaMA-3	Meta-Llama-3-8B-Instruct, Llama-3.1-8B-Instruct
LLaMA 2	LLaMA-2	Llama-2-7b-chat, CodeLlama-7b-Instruct
Mistral / Mixtral	Mistral	Mistral-7B-Instruct-v0.3, Mixtral-8x7B-Instruct

All models must be in GGUF format (Q6_K or Q8_0 quantisation). The chat format is auto-detected from the tokenizer.chat_template field in the GGUF file.

Build

make fence          # optimised build
make DEBUG=1 fence  # debug build with CUDA device-side symbols

Requires CUDA 12+ and an NVIDIA GPU (Makefile defaults to SM 8.9 / RTX 4060; edit SM in the Makefile for other GPUs).

Rust migration workspace (in progress)

The repository now also includes a Rust workspace that mirrors major module boundaries for the migration effort:

crates/common — shared errors/types/config/chat-format detection
crates/gguf — GGUF parser with mmap and metadata/tensor indexing
crates/tokenizer — tokenizer/chat prompt formatting scaffold
crates/kernels — quantized type layouts + Q6_K swizzled extraction checks
crates/model — model runtime scaffold and config wiring
crates/fence-cli — Rust CLI entrypoint

Build and test the Rust workspace:

cargo fmt --all
cargo test --workspace

Run the Rust CLI (scaffolded):

cargo run -p fence-cli -- --model /path/to/model.gguf --prompt "Hello"

Note: the Rust implementation is currently a migration baseline and does not yet provide full CUDA inference parity with the C++/CUDA engine.

Run

# Interactive chat
./build/fence --model /path/to/model.gguf

# Single prompt
./build/fence --model /path/to/model.gguf --prompt "What is the capital of France?"

# All options
./build/fence --help

CLI Options

Flag	Default	Description
`--model PATH`	(required)	GGUF model file
`--prompt TEXT`	—	Single prompt, then exit
`--system TEXT`	You are a helpful assistant.	System prompt
`--max-tokens N`	512	Max new tokens per response
`--max-ctx N`	4096	Context window size
`--temp F`	0.7	Sampling temperature
`--top-p F`	0.8	Nucleus sampling cutoff
`--top-k N`	50	Top-K candidates
`--rep-penalty F`	1.15	Repetition penalty
`--debug`	off	Print per-layer activation stats

Tests

make test_q6k_dot                               # kernel unit test (no model needed)
make test_gguf_parser MODEL=/path/to/model.gguf # GGUF parser integration test
make test_gemv_q6k    MODEL=/path/to/model.gguf # GEMV kernel integration test
make debug_forward && ./build/debug_forward /path/to/model.gguf

Source Layout

src/
  main.cu                   — CLI entry point, interactive chat loop
  model/
    qwen3.h / qwen3.cu      — Model loading, forward pass, sampling
    chat_template.h         — Chat format detection (ChatML / LLaMA-3 / Mistral / LLaMA-2)
  tokenizer/
    tokenizer.h / tokenizer.cpp  — GPT-2 BPE tokenizer, multi-format prompt building
  gguf/
    gguf_parser.h / gguf_parser.cpp  — Zero-copy mmap GGUF v2/v3 parser
  kernels/
    ops.cuh                 — RMSNorm, RoPE, SwiGLU, GQA attention, softmax
    gemv_q6k.cuh            — Q6_K matrix-vector multiply
    gemv_q8_0.cuh           — Q8_0 matrix-vector multiply (embedding lookup)
    q6k_dot.cuh             — Q6_K dot-product primitives
    q6k_types.h / q8_0_types.h  — Quantised block structs
tests/
  test_q6k_dot.cu           — Q6_K kernel correctness tests
  test_gguf_parser.cpp      — GGUF parser integration tests
  test_gemv_q6k.cu          — GEMV Q6_K integration tests
  debug_forward.cu          — Layer-by-layer activation statistics

Credits & Licenses

GGUF format / quantisation types: llama.cpp / GGML — MIT License
GPT-2 byte-level BPE: OpenAI tiktoken — MIT License
Engine source code: MIT License
Model weights: see each model's individual licence (typically Apache 2.0 or Llama Community Licence)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
crates		crates
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fence Inference Engine

Supported Architectures

Build

Rust migration workspace (in progress)

Run

CLI Options

Tests

Source Layout

Credits & Licenses

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fence Inference Engine

Supported Architectures

Build

Rust migration workspace (in progress)

Run

CLI Options

Tests

Source Layout

Credits & Licenses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages