-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference without ONNX / usage of WONNX as backend for LLMs #169
Comments
Hi @philpax, thanks for bringing this up here. I have given the idea of running LLM’s through wonnx some thought over the past few days and I think it would actually be a great addition (as you say it would provide cross-platform GPU-inference even for non-NVIDIA hardware). Adding a builder API would be a good first step. Instead of constructing an ONNX model, it could construct a WONNX IR graph directly (the IR currently is based on nodes that are enums containing mostly ONNX structs, so behind the scenes we would still partially build an ONNX graph, but with a much simpler interface. Eventually we can replace the ONNX structs with our own containing just the bits we need/support). Ideally LLM makes similar calls to wonnx as it does currently to ggml. In short I think we need to implement the following:
I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think! |
That sounds fantastic! Glad to see you're as interested as I am 🙂
Yep, that's reasonable. I'd imagine this would look something like giving
We split up our model and inference, so that a
Yes, this is a little complicated as GGML defines its own quantization formats. You can see what https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu
This all sounds great to me. Happy to work with you on it - just let me know what you need! |
Wonnx does something similar but cannot (yet) share a model and its constant tensors between sessions. It is a good idea to make this split (maybe not for an MVP but still).
As long as we have a way to quantize/dequantize in WGSL it can be made to work I guess!
Let’s settle on a target/reference model to work with - that allows to compare CPU output with our output. What do you think would be a good reference? (a specific LLaMA-like, 3B, fp8?) Also it would be helpful if you could investigate the ops we would need and whether there are equivalents in WONNX/ONNX already. If not, we ideally have a reference implementation somewhere else (e.g. in ggml but we could also have a look at MLC-LLM’s WebLLM, there should be some WGSL there?) |
Sorry about the delay in getting back to you!
Yeah, I noticed that. Nice to have, but not a showstopper.
It should all be possible, but I'm not sure what the best way to handle the changing GGML quantization formats is. Does it make sense to have support for the formats directly in wonnx?
Agreed - there are lots of LLaMA models out there, but it's best to go for something unburdened. I'd suggest something like the RedPajama models, which are based on the GPT-NeoX architecture, have 3B variants, and can easily be quantized to whatever format: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1 There's already readily-available GGML support, so I'm not too worried about it from our end. Ideally, we can compare outputs by sampling the most probable token each time, but I suspect the differences between GPU and CPU computation will lead to inconsistent results anyway. Perhaps we can measure perplexity?
The operations used by our existing suite of models are the following, with the ones used by GPT-NeoX bolded:
Unfortunately, some of these are quite unclear as GGML's documentation on the actual operations is sparse and the implementations are quite dense. I'll have to further investigate. From these, I can say that |
I took a quick look at the ops in bold and I think most will be rather easy to implement. Some ops may not even be needed:
I will make a first attempt at the builder API later today (jet lag permitting). |
@philpax @pixelspark An example of a quantized GEMM in WGSL: WGSL features some handy packing/unpacking functions to make quantisation easier, however these don't extend to INT4, but it is quite trivial:
|
@philpax I am trying to get RedPajama up and running with Could you perhaps point me to a specific |
@pixelspark Here are some converted repajama models which should work with the latest I can also recommend, MPT based models which are also openly licensed. (Instructions can be found in the HF repository). |
That link shows a 404 for me? |
Sorry, the Repository was still private. But i would still recommend to use MPT as some GptNeoX based models (Including Redpajama) have problems with added BOS tokens. (see rustformers/llm#270) |
Apologies for the confusion there, it's been a bit hectic. We now target GGJT v3/QNT2 exclusively, as of five minutes ago 😅 Yes, RedPajama models are sensitive to BOS, but that shouldn't impact your experimentation too much. I'd suggest sticking with GPT-NeoX as it's a relatively well-established architecture with several models being built on top of it (Pythia, StableLM, RedPajama, etc). I also realized that my list excludes a few operations that I thought were no-ops in GGML, but I've since realized still create new tensors (with the reshaping happening at the point of tensor creation, and not at the point of graph computation). These operations are:
Yes, GGML can be annoyingly low-level at times :/ I'd maybe suggest skipping the GGML implementation for now and going straight for reimplementing the original Python implementations. They're less likely to encode details like that. |
So I finally got @LLukas22 could you point me at the model files I should use now? Below are my results with the RedPajama files currently in HF:
OK, seems like a good idea
Yes, I noticed these when browsing the
Yes, but ideally we do load the model weights from the GGML quantized model formats (as that is what So let's do the exercise one more time then: what is a good reference Python implementation for GPT-NEOX, what ops does it use, and how do they map to the currently supported ops in Ideally we would have a picture like this one (linked from here) for GPT-NEOX and with the ops we are going to use/need (preferably existing ONNX implemented ops but I'm open to adding new custom ops to |
@pixelspark GGML recently updated their quantization format (see ggerganov/llama.cpp#1508). Yesterday these changes were merged into |
Alright the models are converted and uploaded. I also added Pythia models, which are smaller GptNeoX models we could use for development. |
@LLukas22 still seeing the following: SHA-256 Hash of the file as I see it (PowerShell D0E1BC0DEA48252CCE95552DBCA4E74DE9D49024C4583DEDD497359A89B2F9A2 As for MPT: Do I need to use other files? |
Hm strange, i'm using the exact same model and same git revision and its working as expected (First few tokens are garbled for redpajama, because of the BOS issue). Maybe you have to run a |
@LLukas22 sorry I am stupid - forgot to do a Still getting some weird results though: MPT is fine apparently although... very elaborate (?): |
As previously mentioned the strange results from Redpajama are expected as the CLI uses the wrong BOS token atm. |
Just wanted to add that the Burn team love to use parts of WONNX as a backend for Burn (without ONNX). We are tracking this work here (tracel-ai/burn#243). CCing @nathanielsimard since he has started doing research in this area. |
Is your feature request related to a problem? Please describe.
I'm one of the maintainers of the llm project, and we're looking for a robust, cross-platform GPU inferencing solutions for our LLM models. We currently have computation graphs for GGML, but are planning on installing some kind of abstraction for use with other backends.
I'm investigating the use of
wonnx
as a potential backend, but it is (understandably!) coupled to ONNX. I was wondering if it would be possible to specify a computation graph directly for compilation/inference without going through ONNX.Describe the solution you'd like
A builder API for computation graphs, or something similar, so that a
wonnx::Session
could be created without the use of ONNX.Describe alternatives you've considered
I've considered constructing a
wonnx::onnx::ModelProto
at runtime, but the ONNX format contains a lot of things we don't need or don't have.It's designed for self-contained models; however, we are loading weights from arbitrary locations and supplying our own computation graph, making it difficult for us to synthesize a complete ONNX model.
Additional context
There's no particular hurry on this. We'd love to have GPU inference as soon as possible - especially truly cross-platform, non-CUDA (!) inference - but I assume this would be a large body of work.
I'm also not sure what operations would need to be implemented for our use case, but we would file PRs as required to implement any missing operations.
The text was updated successfully, but these errors were encountered: