Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add RWKV models support #846

Closed
multimediaconverter opened this issue Apr 8, 2023 · 36 comments · Fixed by #8980
Closed

llama : add RWKV models support #846

multimediaconverter opened this issue Apr 8, 2023 · 36 comments · Fixed by #8980
Labels
good first issue Good for newcomers help wanted Extra attention is needed model Model specific

Comments

@multimediaconverter
Copy link

multimediaconverter commented Apr 8, 2023

RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves memory.

Info: https://github.com/BlinkDL/ChatRWKV

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Experimental GGML port: https://github.com/saharNooby/rwkv.cpp

The lastest "Raven"-series Alpaca-style-tuned RWKV 14B & 7B models are very good.
Online demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
Download: https://huggingface.co/BlinkDL/rwkv-4-raven


Edit by @ggerganov:

Adding @BlinkDL's comment below to OP for visibility:

v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661

a few remarks:

  • rwkv models have RNN-style "one" mode, and GPT-style "seq" mode
  • i am actually using exp(-exp(w))
  • seems it's good to precompute embedding+emb_layernorm in bf16
  • when using fp16, i am doing /2 every 6 layers, to avoid overflow
@Green-Sky
Copy link
Collaborator

closing this in favor of ggerganov/ggml#21

also https://github.com/saharNooby/rwkv.cpp seems to be it.

@someone13574
Copy link

someone13574 commented Nov 1, 2023

Now that support for other models is being added directly to llama.cpp, would rwkv support be reconsidered? It would be very nice to support it since support would mean it gets all the benefits that llama.cpp has over a separate project for only rwkv.

@ggerganov
Copy link
Owner

We should try to add it - it will probably be the most different compared to all other arches that we support as it is LSTM based so it will be a good exercise to see how easy it would fit in the existing framework

@ggerganov ggerganov changed the title [Feature request] Add RWKV models support llm : add RWKV models support Nov 1, 2023
@ggerganov ggerganov added help wanted Extra attention is needed good first issue Good for newcomers labels Nov 1, 2023
@ggerganov ggerganov reopened this Nov 1, 2023
@ggerganov ggerganov changed the title llm : add RWKV models support llama : add RWKV models support Nov 1, 2023
@BlinkDL
Copy link

BlinkDL commented Nov 1, 2023

@ggerganov Please check these :)

v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661

a few remarks:

  • rwkv models have RNN-style "one" mode, and GPT-style "seq" mode
  • i am actually using exp(-exp(w))
  • seems it's good to precompute embedding+emb_layernorm in bf16
  • when using fp16, i am doing /2 every 6 layers, to avoid overflow

@KerfuffleV2
Copy link
Collaborator

Not sure if it helps, but I have a GGML-based Rust implementation here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/graph.rs (that's just v4 inference)

This is actually the reason I made my first contribution to the project, trying to get the map ops (now superseded) to work around what GGML didn't support. I think that's mostly still the case, so the majority of these will probably still need to use custom mapping: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/map_ops.rs (the one_minus one is mainly just an optimization).

@saharNooby
Copy link

saharNooby commented Nov 2, 2023

Hi all! Maintainer of rwkv.cpp here.

Indeed, having a separate repository for RWKV leads to ggml version lag, lack of computation backends that I can't commit to support with my limited time, and other issues.

That said, I like compactness and simplicity of rwkv.cpp repository; huge repos like llama.cpp with 10K+ lines C++ files scare me; though this is a subjective preference. I would not be able to commit supporting RWKV implementation in llama.cpp repo.

In the end, users will decide :)


On a more practical note:

If support for RWKV will be added into llama.cpp, I also suggest implementing conversion script for handling model files in rwkv.cpp format. The format is documented here. There are models hosted in Hugging Face in this format -- for example, here. kobold.cpp also supports this format.

Furthermore, if support for both RWKV v4 and RWKV v5 is implemented in llama.cpp, including conversion from rwkv.cpp format; and there is a reasonable commitment from maintainers of llama.cpp to fix bugs/add new versions of RWKV, I will be OK to mark rwkv.cpp as deprecated, add a link llama.cpp and stop maintaining the repo.

Until then, my plan is to continue support rwkv.cpp, including adding RWKV v5 support sometime later.

I won't be able to help with migrating rwkv.cpp code to llama.cpp, but of course anyone is free to use rwkv.cpp as a reference (or even copy-paste code -- not sure how licensing works).

@ggerganov
Copy link
Owner

Hi @saharNooby - great work with rwkv.cpp

I'm mainly interested to see what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to. It turned out that all the LLMs that we support so far are pretty much 99% the same thing with bias here and norm there. So I'm not sure how well the framework would accommodate a model that is fundamentally different, assuming RWKV is one (I haven't even looked in the details, so I don't really know if this statement is true).

I'm looking forward to contributions as I doubt I will have the time to implement it myself. So we will have to see if RWKV support will end up in llama.cpp at all. In any case, it's too early and definitely do not deprecate rwkv.cpp at this point.

Alternatively, we should also look for other LLM architectures that would present some sort of a challenge and try to integrate them as well, in the same spirit to understand what llama.cpp needs to be more general-purpose.

@saharNooby
Copy link

what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to

Regarding ggml: for a long time rwkv.cpp have used vanilla ggml, and only recently ggml was forked and a crutch was added to support very large cgraphs: Increase GGML_MAX_NODES from 4096 to 80000. But looks like you've recently removed this node limit altogether. Overall, I don't expect any changes will be required to ggml in order to support RWKV.

Regarding llama.cpp file: looks like I got what you mean -- supporting a new architecture in the file and surrounding infra (scripts, etc.) can indeed be difficult. Can't comment on that :)

that is fundamentally different, assuming RWKV is one

The only difference is that Attention was replaced with WKV, which can be computed in recurrent manner. Everything else -- layer structure, MLP, embed/unembed are same as in Transformers. Some early versions of RWKV even use the popular 20B_tokenizer; although later ones use custom World tokenizer which would need to be implemented (it's simple, does not even require Unicode normalization).

definitely do not deprecate rwkv.cpp at this point

Yep!

@BlinkDL
Copy link

BlinkDL commented Nov 2, 2023

I'm mainly interested to see what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to. It turned out that all the LLMs that we support so far are pretty much 99% the same thing with bias here and norm there. So I'm not sure how well the framework would accommodate a model that is fundamentally different, assuming RWKV is one (I haven't even looked in the details, so I don't really know if this statement is true).

the real difference is RWKV (and other "linear attention" models) uses a fixed-size state instead of a growing kv cache :)

so it's like:

output, new_state = model.forward(input, current_state)

and you can clone & save states, to make a "state cache" for various inputs to accelerate inference.

@BlinkDL
Copy link

BlinkDL commented Nov 2, 2023

RWKV v4 in 100 lines (using numpy): https://johanwind.github.io/2023/03/23/rwkv_details.html

another blogpost: https://fullstackdeeplearning.com/blog/posts/rwkv-explainer/

v4 details: https://ben.bolte.cc/rwkv-model

RWKV zoom talk (TUE, NOV 7 · 9:30 AM CST): https://www.meetup.com/silicon-valley-generative-ai/events/296395124/

RWKV sf meet (Saturday, Nov 11 1:00pm PT): https://partiful.com/e/bi6lGCvZXCzZQNN5FjXW

@Cyberhan123
Copy link

I'm excited to see rwkv's progress, I love this model.

@KerfuffleV2
Copy link
Collaborator

Is there a way to make RWKV's state stuff fit in with the current concept of sequences and KV cache manipulation? Can you do parallel generation with multiple independent sequences?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 21, 2023

If it's helpful, I asked some questions in the RWKV discord:


[2:06 AM] Kerfuffle: This might be a pretty dumb question, but just thinking about how RWKV could fit into llama.cpp. Probably the biggest thing is figuring out how it can work with llama.cpp's idea of batches and sequences and parallel generation.
When doing generation, the API lets you add items to the batch, each one has: token id, sequence id, and position in the sequence. Then you call decode and it can run decode on all the items in the batch in parallel.
The API also includes KV cache manipulation stuff, so for example you can undo generation of the last N tokens and that kind of thing.
So now the actual question: Can you evaluate multiple independent sequences in parallel with RWKV? And also, can you edit the state kind of like the KV cache stuff when you are able to do something like remove some previously generated tokens from it?

[3:12 AM] Tomeno: you can run rwkv in parallel, but you can't edit the state like that - what you can do though is save and roll back to previous versions of the state cheaply

[3:20 AM] Kerfuffle: Thanks for the answer. Is there a way to save/roll back the state just for specific sequences when doing parallel generation?

[3:30 AM] Tomeno: well, i should say, save and load the state - the state is a "compressed" version of the entire context/sequence up to that point

[3:45 AM] Tomeno: so no, once it's processed, you can't separate the tokens that went into it

[3:46 AM] Tomeno: what you could do is something like save the state after every reply of a chatbot, and then you could load any point in that conversation back up and continue from there

[3:47 AM] Tomeno: or save a number of states to disk and load them back up at any time, no matter how long the input sequence was, the state is about the same size

[3:52 AM] Kerfuffle: Thanks again. I guess the main issue is keeping the state of sequences separate which I guess actually isn't possible.

[3:53 AM] Kerfuffle: Seems like it would be really hard to fit RWKV into llama.cpp as an alternative model architecture.

[4:17 AM] Kerfuffle: I feel like there's got to be a way to do separate sequences in general otherwise it's a HUGE strike against RWKV. Just for example, suppose I have an RWKV model that works as well as ChatGPT. I want to set up a website where people can query it. A service like that requires submitting queries in huge batches, doing a completely separate decode for each individual user just wouldn't work.

[4:20 AM] Tomeno: oh wait, i misunderstood what you meant

[4:20 AM] Tomeno: when you process multiple sequences in parallel, each of them has its own associated state

[4:21 AM] Tomeno: put very simply, the input to rwkv is state + next token

[4:23 AM] Kerfuffle: Ah, okay, good. Yeah, I have a vague idea of how it probably works then.

[4:23 AM] Tomeno: i thought when you wrote "roll back the state for specific sequences" you meant, like, take out a set of tokens from the context

[4:23 AM] Kerfuffle: You could just let each sequence have its own state and somehow do the calculation so the correct state is involved for each sequence.

[4:23 AM] Kerfuffle: You were correct. :) I was actually asking about both things.

[4:24 AM] Kerfuffle: I'm just generally trying to figure out how practical it is (or practical within my capabilities) to try to add RWKV support to llama.cpp

[4:24 AM] Tomeno: there were some demos of parallel inference posted recently though i have no idea how to find it

[4:25 AM] Kerfuffle: Well, the first step is knowing it's even possible, so that definitely helps.

[4:26 AM] Mathmagician: I think web-rwkv lets you inference multiple sequences in parallel


This is the web-rwkv implementation that was mentioned: https://github.com/cryscan/web-rwkv/

From that conversion, it seems like parallel generation wouldn't be too much of a problem. Howevever KV editing operations like rewinding or whatever seem like they would be extremely difficult. Tomeno mentioned saving the RWKV sequence state per token, which may be possible but I'm guessing the per token state is going to be too large to really make that practical. So I think the only way it could really work with how llama.cpp's KV cache manipulation ops work is to only allow completely clearing a sequence and nothing else.

On an unrelated note, a WebGPU backend seems like an interesting idea... web-rwkv uses WebGPU as its GPU backend. It actually ran pretty fast for me when I tried the example, and it probably would be possible to interface with the Rust wgpu crate from C++.

@BlinkDL
Copy link

BlinkDL commented Nov 21, 2023

you can save RWKV state per n tokens. and you can save them to ram / hd.

@KerfuffleV2
Copy link
Collaborator

you can save RWKV state per n tokens. and you can save them to ram / hd.

I'm looking at it from the perspective of how it can be integrated into llama.cpp existing architectures. How big is the state? For 3B World5 is it 2560x2560?

@BlinkDL
Copy link

BlinkDL commented Nov 22, 2023

(2+64)*2560 numbers for each block

32*(2+64)*2560 numbers for full model

@19h
Copy link

19h commented Jan 29, 2024

There's been renewed progress in the RWKV space with Eagle-7b: https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers.

@sorasoras
Copy link

RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM.

@compilade
Copy link
Collaborator

RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM.

If nobody else does it, I'll have time to work on RWKV in llama.cpp starting in May (in a month and a half).

Mamba took me a bit more than a month to implement in llama.cpp (but basic inference (with --batch-size 1) had been working after the first week). I expect RWKV will be slightly easier to implement since part of the work has already been thought through (KV cache API compatibility with recurrent models). It would be nice to make simultaneous state processing with recurrent models not require a custom ggml operator for each state type, though. I'll think about ways to make it simpler when I'll get to it.

If anyone reading this is interested in working on this before I have more time, feel free to go ahead.

@LaylBongers
Copy link
Contributor

I've been taking up the task of implementing support for the RWKV 5 architecture. I've had some issues getting the included python conversion code adapted for RWKV, however. Of course, this is the first step to getting RWKV working.
I've been working on a conversion tool this week that I'll likely be publishing soon, after which I'll start implementing the architecture within llama.cpp. I'll keep everyone up to date as I'm working on it.

@hiepxanh
Copy link

Great to know that 🥰🥰🥰

@BlinkDL
Copy link

BlinkDL commented Mar 29, 2024

please try the much stronger v6.0-world 2.1 model :) design similar to v5. 1b6 done, 3b 7b soon

https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1

https://twitter.com/BlinkDL_AI/status/1773503808221712722

@LaylBongers

The difference between v6 and v5:
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py
vs
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

@LaylBongers
Copy link
Contributor

Over easter we've got a long weekend here, but I figured I'd give a few updates on my work on this:

  • I had issues adapting the python conversion code, so I've created and published a toolset to handle RWKV conversion for now. A bit too early and unverified to use just yet practically, but it's on the recursal org. I'll be using this as the basis for RWKV GGUF testing.
  • I've cloned the repo and started hacking away at it to add support, not much progress there yet.

On RWKV v6, I hadn't seen that demo yet! It looks straightforward to add both once one of the two is working.

@LaylBongers
Copy link
Contributor

I got the tokenizer in and functional so far, working with the "tokenize" example. I'm considering submitting the tokenizer by itself as a small PR, to reduce review load, any thoughts on this?

@ggerganov
Copy link
Owner

Either way would be fine - the tokenizer alone might not be useful for anything else other than RWKV, so no point in merging it alone

@LaylBongers
Copy link
Contributor

I'm hitting some issues with the vk cache initialization, taking this moment to update on the work done so far.

WIP code available here: https://github.com/RWKV/llama.cpp
Containing right now just the tokenizer, and an attempt at placeholder model loading and graph initialization.

This can be tested using a partial generated GGUF over here, generated using gguf-swiss:
https://huggingface.co/LaylBongers/temp-rwkvgguf-partial/tree/main

Currently I'm having some issues tracking down an initialization issue:

ggml_backend_alloc_ctx_tensors_from_buft: all tensors in the context are already allocated
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

@compilade
Copy link
Collaborator

compilade commented Apr 18, 2024

I'm hitting some issues with the vk cache initialization

The KV cache for recurrent models is sized from the GGUF metadata keys {model}.ssm.state_size, {model}.ssm.inner_size, and {model}.ssm.kernel_size. These get read into hparams.ssm_d_state, hparams.ssm_d_inner and hparams.ssm_d_conv, respectively.

The following are used to size the kv_self.k_l and kv_self.v_l tensors for recurrent models:

llama.cpp/llama.cpp

Lines 1865 to 1875 in 0d56246

uint32_t n_embd_k_s() const { // dimension of the rolling state embeddings
// corresponds to Mamba's conv_states size
// TODO: maybe support other convolution strides than 1
// NOTE: since the first column of the conv_state is shifted out each time, it's not actually needed
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * ssm_d_inner;
}
uint32_t n_embd_v_s() const { // dimension of the recurrent state embeddings
// corresponds to Mamba's ssm_states size
return ssm_d_state * ssm_d_inner;
}

If RWKV uses 2 different recurrent states (e.g. one for time mix and the other for channel mix, though I'm not yet sure how they are used), it might be useful to add a new metadata key for the stride of the convolution and make it 0 for RWKV (possibly called {model}.ssm.conv_stride). Otherwise, if only a single recurrent state is required, it should be enough to only use {model}.ssm.state_size and {model}.ssm.inner_size and the v_l tensors. I'd like to make it less Mamba-centric, and re-using metadata keys across RWKV and Mamba could achieve this, though it might make hybrids of the two harder in the future (though such hybrids don't seem likely, I think?).

Re-using k_l and v_l for recurrent states isn't ideal and will be changed soon-ish (work-in-progress at master...compilade/refactor-kv-cache, which will be advancing further once I find more free time) to support hybrid recurrent Transformer models, and so recurrent models will be identified by their use of the relevant metadata keys for the recurrent state size. Parallel sequence management for recurrent models is also slightly simpler in that branch. This is a preview of what is coming next month.

@LaylBongers
Copy link
Contributor

Another update; thank for the notes! I've resolved initial crash issues on initialization, though mostly with hacky temporary placeholders (like re-using ssm scope keys). I'll put up a new version of the temporary GGUF file on Monday. The remainder of the work to be done now is to fill in the rest of the network graph, link it up with the KV cache hack for tracking state, and then start handling all the individual hacks one by one.

@BlinkDL
Copy link

BlinkDL commented May 27, 2024

More reference:
https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/rwkv_v6_demo.py
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo_cuda_bf16.py

@BlinkDL
Copy link

BlinkDL commented Jun 15, 2024

I got the tokenizer in and functional so far, working with the "tokenize" example. I'm considering submitting the tokenizer by itself as a small PR, to reduce review load, any thoughts on this?

please check the unit tests in https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py (v.s. reference tokenizer)
and please verify the binary length of each token (must equal the number at the end of each line)

@BlinkDL
Copy link

BlinkDL commented Jul 8, 2024

https://github.com/RWKV/rwkv.cpp supports v6 now

@MoonRide303
Copy link

Conversion and quantization using b3651 worked fine (src HF model: https://huggingface.co/RWKV/v6-Finch-7B-HF).

Conversation (using llama-server) initially produced some output, but on 6 attempts it crashed 3 times after 1st or 2nd message - ending up with

llama.cpp:3628: GGML_ASSERT(cell.has_seq_id(seq_id)) failed

It doesn't look like fully supported / working, yet.

@MollySophia
Copy link
Contributor

Conversion and quantization using b3651 worked fine (src HF model: https://huggingface.co/RWKV/v6-Finch-7B-HF).

Conversation (using llama-server) initially produced some output, but on 6 attempts it crashed 3 times after 1st or 2nd message - ending up with

llama.cpp:3628: GGML_ASSERT(cell.has_seq_id(seq_id)) failed

It doesn't look like fully supported / working, yet.

Thanks for your testing.
I'll try to reproduce this and see what's wrong later.

@compilade
Copy link
Collaborator

@MoonRide303 @MollySophia This should be fixed in #9249

@MoonRide303
Copy link

@MoonRide303 @MollySophia This should be fixed in #9249

I briefly tested Q6_K quant of Finch 7B using llama-server b3658 - seems to be okay (no longer crashing).

@SinanAkkoyun
Copy link

What tps speeds are you getting on a GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed model Model specific
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.