Inference without ONNX / usage of WONNX as backend for LLMs #169

philpax · 2023-05-17T07:31:40Z

Is your feature request related to a problem? Please describe.
I'm one of the maintainers of the llm project, and we're looking for a robust, cross-platform GPU inferencing solutions for our LLM models. We currently have computation graphs for GGML, but are planning on installing some kind of abstraction for use with other backends.

I'm investigating the use of wonnx as a potential backend, but it is (understandably!) coupled to ONNX. I was wondering if it would be possible to specify a computation graph directly for compilation/inference without going through ONNX.

Describe the solution you'd like
A builder API for computation graphs, or something similar, so that a wonnx::Session could be created without the use of ONNX.

Describe alternatives you've considered
I've considered constructing a wonnx::onnx::ModelProto at runtime, but the ONNX format contains a lot of things we don't need or don't have.

It's designed for self-contained models; however, we are loading weights from arbitrary locations and supplying our own computation graph, making it difficult for us to synthesize a complete ONNX model.

Additional context
There's no particular hurry on this. We'd love to have GPU inference as soon as possible - especially truly cross-platform, non-CUDA (!) inference - but I assume this would be a large body of work.

I'm also not sure what operations would need to be implemented for our use case, but we would file PRs as required to implement any missing operations.

The text was updated successfully, but these errors were encountered:

pixelspark · 2023-05-17T08:07:42Z

Hi @philpax, thanks for bringing this up here. I have given the idea of running LLM’s through wonnx some thought over the past few days and I think it would actually be a great addition (as you say it would provide cross-platform GPU-inference even for non-NVIDIA hardware).

Adding a builder API would be a good first step. Instead of constructing an ONNX model, it could construct a WONNX IR graph directly (the IR currently is based on nodes that are enums containing mostly ONNX structs, so behind the scenes we would still partially build an ONNX graph, but with a much simpler interface. Eventually we can replace the ONNX structs with our own containing just the bits we need/support).

Ideally LLM makes similar calls to wonnx as it does currently to ggml.

In short I think we need to implement the following:

Builder API to construct WONNX IR directly without (visible) reliance on ONNX.
A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.
The various ops used in LLM, probably as ‘custom’ non-ONNX standardized operators (i.e. an operator named “ggml.Rope”).
How to handle caches between inference runs.
Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

philpax · 2023-05-17T10:47:52Z

That sounds fantastic! Glad to see you're as interested as I am 🙂

A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.

Yep, that's reasonable. I'd imagine this would look something like giving wonnx a buffer to the raw data and having it upload it to the GPU.

How to handle caches between inference runs.

We split up our model and inference, so that a Model can remain resident in memory and inference can be done against that Model using an InferenceSession. A similar change might be worthwhile?

Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

Yes, this is a little complicated as GGML defines its own quantization formats. You can see what llama.cpp's CUDA code for unpacking/operations looks like here:

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

This all sounds great to me. Happy to work with you on it - just let me know what you need!

pixelspark · 2023-05-17T11:11:23Z

That sounds fantastic! Glad to see you're as interested as I am 🙂

A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.

Yep, that's reasonable. I'd imagine this would look something like giving wonnx a buffer to the raw data and having it upload it to the GPU.

How to handle caches between inference runs.

We split up our model and inference, so that a Model can remain resident in memory and inference can be done against that Model using an InferenceSession. A similar change might be worthwhile?

Wonnx does something similar but cannot (yet) share a model and its constant tensors between sessions. It is a good idea to make this split (maybe not for an MVP but still).

Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

Yes, this is a little complicated as GGML defines its own quantization formats. You can see what llama.cpp's CUDA code for unpacking/operations looks like here:

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu

As long as we have a way to quantize/dequantize in WGSL it can be made to work I guess!

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

This all sounds great to me. Happy to work with you on it - just let me know what you need!

Let’s settle on a target/reference model to work with - that allows to compare CPU output with our output. What do you think would be a good reference? (a specific LLaMA-like, 3B, fp8?)

Also it would be helpful if you could investigate the ops we would need and whether there are equivalents in WONNX/ONNX already. If not, we ideally have a reference implementation somewhere else (e.g. in ggml but we could also have a look at MLC-LLM’s WebLLM, there should be some WGSL there?)

philpax · 2023-05-19T01:31:22Z

Sorry about the delay in getting back to you!

Wonnx does something similar but cannot (yet) share a model and its constant tensors between sessions. It is a good idea to make this split (maybe not for an MVP but still).

Yeah, I noticed that. Nice to have, but not a showstopper.

As long as we have a way to quantize/dequantize in WGSL it can be made to work I guess!

It should all be possible, but I'm not sure what the best way to handle the changing GGML quantization formats is. Does it make sense to have support for the formats directly in wonnx?

Let’s settle on a target/reference model to work with - that allows to compare CPU output with our output. What do you think would be a good reference? (a specific LLaMA-like, 3B, fp8?)

Agreed - there are lots of LLaMA models out there, but it's best to go for something unburdened. I'd suggest something like the RedPajama models, which are based on the GPT-NeoX architecture, have 3B variants, and can easily be quantized to whatever format: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1

There's already readily-available GGML support, so I'm not too worried about it from our end. Ideally, we can compare outputs by sampling the most probable token each time, but I suspect the differences between GPU and CPU computation will lead to inconsistent results anyway. Perhaps we can measure perplexity?

Also it would be helpful if you could investigate the ops we would need and whether there are equivalents in WONNX/ONNX already.

The operations used by our existing suite of models are the following, with the ones used by GPT-NeoX bolded:

add: Adds two tensors together.
alibi: Attention with LInear BIases
cont: Unsure, need to figure this out
cpy: Copies the contents of one tensor to another.
diag_mask_inf: Creates a tensor where the elements above the diagonal are -inf.
gelu: Gaussian Error Linear Units
get_rows: Copies the rows from one tensor to another. Not sure how this is different to cpy.
mul: Element-wise multiplication of tensors.
mul_mat: Matrix multiplication.
norm: Normalise each row of the tensor.
repeat: Not sure.
rms_norm: RMS norm each row of the tensor.
rope: ROtary Positional Encoding
scale: Scales a tensor by a scalar.
silu: SiLU activation function
soft_max: Softmax function

Unfortunately, some of these are quite unclear as GGML's documentation on the actual operations is sparse and the implementations are quite dense. I'll have to further investigate.

From these, I can say that add, mul, mul_mat, softmax are listed in the README (and supported). I'm not sure about the others, but some of them should be trivial to compose from existing operations.

pixelspark · 2023-05-21T04:37:14Z

I took a quick look at the ops in bold and I think most will be rather easy to implement. Some ops may not even be needed:

All cont appears to do is make a contiguous copy of some other tensor, which typically is a ‘view’ of some tensor. The existing Gather op may be able to provide similar functionality, and if not this is very easy to implement).
cpy is likely not necessary as in wonnx, intermediate buffers are immutable and only reused between nodes if their lifetimes allow for it. The only reason to use copy is between iterations (copy output to some input).
repeat repeats one tensor as often as necessary to fill a certain shape. Trivial to inplement as new op.
scale and gelu look like simple element-wise mappings (just need to write the WGSL for the specific mappings)
permute just seems to shuffle dimension metadata. Optimized away in wonnx IR.

I will make a first attempt at the builder API later today (jet lag permitting).

FL33TW00D · 2023-05-21T09:08:44Z

@philpax @pixelspark An example of a quantized GEMM in WGSL:
https://github.com/FL33TW00D/wgpu-mm/blob/feature/kernel-series/shaders/gemv/qgemv_1.wgsl

WGSL features some handy packing/unpacking functions to make quantisation easier, however these don't extend to INT4, but it is quite trivial:

fn unpackInt4x8(value: i32, absmax: f32) -> array<vec4<f32>, 2> {
    let x = f32((value << 28) >> 28) / 7.0 * absmax;
    let y = f32((value << 24) >> 28) / 7.0 * absmax;
    let z = f32((value << 20) >> 28) / 7.0 * absmax;
    let w = f32((value << 16) >> 28) / 7.0 * absmax;
    let a = f32((value << 12) >> 28) / 7.0 * absmax;
    let b = f32((value << 8) >> 28) / 7.0 * absmax;
    let c = f32((value << 4) >> 28) / 7.0 * absmax;
    let d = f32((value >> 28)) / 7.0 * absmax;
    return array<vec4<f32>, 2>(vec4<f32>(x, y, z, w), vec4<f32>(a, b, c, d));
}

pixelspark · 2023-05-23T21:42:02Z

Agreed - there are lots of LLaMA models out there, but it's best to go for something unburdened. I'd suggest something like the RedPajama models, which are based on the GPT-NeoX architecture, have 3B variants, and can easily be quantized to whatever format: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1

@philpax I am trying to get RedPajama up and running with llm just to familiarize myself with it a bit, but can't get it to work, possibly due to the recent changes in ggml quantization formats. The link above contains a Torch model (which I am sure can be converted to GGML but haven't managed to do). Other repositories have GGML versions but with different file versions, it appears.

Could you perhaps point me to a specific .bin file that should work with latest llm?

LLukas22 · 2023-05-24T07:45:31Z

@pixelspark Here are some converted repajama models which should work with the latest main branch. (I havent created the readme yet).

I can also recommend, MPT based models which are also openly licensed. (Instructions can be found in the HF repository).

pixelspark · 2023-05-24T11:16:24Z

@pixelspark Here are some converted repajama models which should work with the latest main branch. (I havent created the readme yet).

That link shows a 404 for me?

LLukas22 · 2023-05-24T11:43:11Z

Sorry, the Repository was still private. But i would still recommend to use MPT as some GptNeoX based models (Including Redpajama) have problems with added BOS tokens. (see rustformers/llm#270)

pixelspark · 2023-05-24T11:56:53Z

Thanks @LLukas22, will try this later. If it works I can start investigating the different ops it uses (as listed by @philpax) and check if we can implement those.

philpax · 2023-05-25T00:57:58Z

Apologies for the confusion there, it's been a bit hectic. We now target GGJT v3/QNT2 exclusively, as of five minutes ago 😅

Yes, RedPajama models are sensitive to BOS, but that shouldn't impact your experimentation too much. I'd suggest sticking with GPT-NeoX as it's a relatively well-established architecture with several models being built on top of it (Pythia, StableLM, RedPajama, etc).

I also realized that my list excludes a few operations that I thought were no-ops in GGML, but I've since realized still create new tensors (with the reshaping happening at the point of tensor creation, and not at the point of graph computation). These operations are:

permute: Creates a view tensor with permuted dimensions, such that the dimensions are swapped around in accordance with the axes
reshape_2d: Creates a view tensor with dimensions Y*Z from a tensor W*X, where W*X == Y*Z
reshape_3d: Same as 2D, but for three dimensions
transpose: Creates a view tensor for a 2D tensor in which the two dimensions are swapped
view_1d: Creates a 1D view tensor of length ne0 from offset bytes into the tensor
view_2d: Creates a 2D view tensor of size ne0 * ne1 from offset bytes into the tensor, with a stride of nb1 bytes
view_3d: Creates a 3D view tensor of size ne0 * ne1 * ne2 from offset bytes into the tensor, with a stride of nb1 and nb2 bytes

Yes, GGML can be annoyingly low-level at times :/

I'd maybe suggest skipping the GGML implementation for now and going straight for reimplementing the original Python implementations. They're less likely to encode details like that.

pixelspark · 2023-05-25T08:35:53Z

Apologies for the confusion there, it's been a bit hectic. We now target GGJT v3/QNT2 exclusively, as of five minutes ago 😅

So I finally got llm working with mpt and RedPajama yesterday, now it doesn't anymore... very confusing indeed!

@LLukas22 could you point me at the model files I should use now? Below are my results with the RedPajama files currently in HF:

RedPajama-INCITE-Chat-3B-v1-f16.bin: loads, but produces gibberish:

RedPajama-INCITE-Base-3B-v1-q4_0-ggjt: does not load:

RedPajama-INCITE-Chat-3B-v1-q4_0.bin: does not load:

RedPajama-INCITE-Chat-3B-v1-q4_0-ggjt.bin: same

Yes, RedPajama models are sensitive to BOS, but that shouldn't impact your experimentation too much. I'd suggest sticking with GPT-NeoX as it's a relatively well-established architecture with several models being built on top of it (Pythia, StableLM, RedPajama, etc).

OK, seems like a good idea

I also realized that my list excludes a few operations that I thought were no-ops in GGML, but I've since realized still create new tensors (with the reshaping happening at the point of tensor creation, and not at the point of graph computation). These operations are:

permute: Creates a view tensor with permuted dimensions, such that the dimensions are swapped around in accordance with the axes

reshape_2d: Creates a view tensor with dimensions Y*Z from a tensor W*X, where W*X == Y*Z

reshape_3d: Same as 2D, but for three dimensions

transpose: Creates a view tensor for a 2D tensor in which the two dimensions are swapped

view_1d: Creates a 1D view tensor of length ne0 from offset bytes into the tensor

view_2d: Creates a 2D view tensor of size ne0 * ne1 from offset bytes into the tensor, with a stride of nb1 bytes

view_3d: Creates a 3D view tensor of size ne0 * ne1 * ne2 from offset bytes into the tensor, with a stride of nb1 and nb2 bytes

Yes, I noticed these when browsing the llm code that builds the graph. Thanks for digging into these and the above descriptions, very helpful! These ops are probably all trivial to implement (and/or not necessary).

Yes, GGML can be annoyingly low-level at times :/

I'd maybe suggest skipping the GGML implementation for now and going straight for reimplementing the original Python implementations. They're less likely to encode details like that.

Yes, but ideally we do load the model weights from the GGML quantized model formats (as that is what llm reads), right?

So let's do the exercise one more time then: what is a good reference Python implementation for GPT-NEOX, what ops does it use, and how do they map to the currently supported ops in wonnx? (In researching this it could be interesting to attempt conversion from Torch to ONNX, not because we want to use the ONNX model, but to see how it maps to ONNX ops, as that is wat we currently follow for wonnx IR).

Ideally we would have a picture like this one (linked from here) for GPT-NEOX and with the ops we are going to use/need (preferably existing ONNX implemented ops but I'm open to adding new custom ops to wonnx as well, even if just to speed up inference for LLM, to support quantization, etc.).

LLukas22 · 2023-05-25T08:56:25Z

@pixelspark GGML recently updated their quantization format (see ggerganov/llama.cpp#1508). Yesterday these changes were merged into llm. This means all quantized models (marked with qX_Y) need to be reconverted. Currently im setting up Github actions to automate these conversions. I will reupload the models to the Rustfomers Organization and add instructions on how to use them. This will take some time as i have to upload around ~200GB of model files. Will let you know when the redpajama models are ready 👍

LLukas22 · 2023-05-25T11:57:00Z

Alright the models are converted and uploaded. I also added Pythia models, which are smaller GptNeoX models we could use for development.
RedPajama: https://huggingface.co/Rustformers/redpajama-ggml
Pythia: https://huggingface.co/Rustformers/pythia-ggml

pixelspark · 2023-05-25T12:18:32Z

@LLukas22 still seeing the following:

SHA-256 Hash of the file as I see it (PowerShell Get-FileHash):

D0E1BC0DEA48252CCE95552DBCA4E74DE9D49024C4583DEDD497359A89B2F9A2

As for MPT:

Do I need to use other files?

LLukas22 · 2023-05-25T12:52:29Z

Hm strange, i'm using the exact same model and same git revision and its working as expected (First few tokens are garbled for redpajama, because of the BOS issue). Maybe you have to run a cargo update in your clone? @philpax Do you have any idea why this could happen?

MPT results:

pixelspark · 2023-05-25T13:00:14Z

@LLukas22 sorry I am stupid - forgot to do a git submodule update. Now it all seems to work!

Still getting some weird results though:

MPT is fine apparently although... very elaborate (?):

LLukas22 · 2023-05-25T13:08:52Z

As previously mentioned the strange results from Redpajama are expected as the CLI uses the wrong BOS token atm.
Tbh i dont exactly know what the chat feature in the CLI does, but i'm guessing it was build for LLama based models. MPT-Chat probably expects a different prompt format, which the current chat implementation does not provide. I would recommend sticking to the infer command if you want to play around with the models.

antimora · 2023-05-26T02:37:22Z

Just wanted to add that the Burn team love to use parts of WONNX as a backend for Burn (without ONNX). We are tracking this work here (tracel-ai/burn#243). CCing @nathanielsimard since he has started doing research in this area.

pixelspark · 2023-05-26T06:48:12Z

@antimora great to hear! If you haven't seen it yet, you might want to have a look at #170. It is work in progress to offer a non-ONNX builder API from WONNX (it will primarily offer access to the implemented ONNX ops but possibly others in the future). Contributions are welcomed!

philpax mentioned this issue May 17, 2023

Non-ggml backend rustformers/llm#31

Open

pixelspark changed the title ~~Inference without ONNX~~ Inference without ONNX / usage of WONNX as backend for LLMs May 17, 2023

philpax mentioned this issue May 20, 2023

Build and execute our own computation graph rustformers/llm#137

Open

pixelspark mentioned this issue May 21, 2023

Builder API and further removal of ONNX from the IR #170

Draft

9 tasks

wsxiaoys mentioned this issue May 30, 2023

Vulkan Backend Support for improved device compatibility TabbyML/tabby#124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference without ONNX / usage of WONNX as backend for LLMs #169

Inference without ONNX / usage of WONNX as backend for LLMs #169

philpax commented May 17, 2023

pixelspark commented May 17, 2023

philpax commented May 17, 2023

pixelspark commented May 17, 2023

philpax commented May 19, 2023

pixelspark commented May 21, 2023 •

edited

Loading

FL33TW00D commented May 21, 2023

pixelspark commented May 23, 2023

LLukas22 commented May 24, 2023 •

edited

Loading

pixelspark commented May 24, 2023

LLukas22 commented May 24, 2023

pixelspark commented May 24, 2023

philpax commented May 25, 2023

pixelspark commented May 25, 2023

LLukas22 commented May 25, 2023

LLukas22 commented May 25, 2023

pixelspark commented May 25, 2023 •

edited

Loading

LLukas22 commented May 25, 2023 •

edited

Loading

pixelspark commented May 25, 2023 •

edited

Loading

LLukas22 commented May 25, 2023

antimora commented May 26, 2023

pixelspark commented May 26, 2023

Inference without ONNX / usage of WONNX as backend for LLMs #169

Inference without ONNX / usage of WONNX as backend for LLMs #169

Comments

philpax commented May 17, 2023

pixelspark commented May 17, 2023

philpax commented May 17, 2023

pixelspark commented May 17, 2023

philpax commented May 19, 2023

pixelspark commented May 21, 2023 • edited Loading

FL33TW00D commented May 21, 2023

pixelspark commented May 23, 2023

LLukas22 commented May 24, 2023 • edited Loading

pixelspark commented May 24, 2023

LLukas22 commented May 24, 2023

pixelspark commented May 24, 2023

philpax commented May 25, 2023

pixelspark commented May 25, 2023

LLukas22 commented May 25, 2023

LLukas22 commented May 25, 2023

pixelspark commented May 25, 2023 • edited Loading

LLukas22 commented May 25, 2023 • edited Loading

pixelspark commented May 25, 2023 • edited Loading

LLukas22 commented May 25, 2023

antimora commented May 26, 2023

pixelspark commented May 26, 2023

pixelspark commented May 21, 2023 •

edited

Loading

LLukas22 commented May 24, 2023 •

edited

Loading

pixelspark commented May 25, 2023 •

edited

Loading

LLukas22 commented May 25, 2023 •

edited

Loading

pixelspark commented May 25, 2023 •

edited

Loading