-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Add text-generation-webui as a back-end to support local models #6
Comments
Yes, definitely want to support more than just OpenAI - and especially open source (ish...) models like llama! The main question here, I think, is how to make sure characters are "consistent enough" whether people are using llama via a paid API, or locally. E.g. if someone is running a 4 bit 13B llama model and they tune their character nicely, but then someone else gets that character via a sharing link and uses an 8 bit version of the model, or uses inference code that has a slightly different sampling method, or something like that - then they might get significantly different results. So I guess ideally we'll soon have a very "standard" interface to the llama models (likely huggingface transformers), that'll give us an standard/"objective" set of parameters to use - whether someone is running locally, or via huggingface inference API, or with a docker image on runpod.io, etc. The pace at which the community is moving with llama suggests that this won't be too far away - if the licensing stuff doesn't get in the way. People should feel free to add relevant community developments to this thread so we can track feasibility here. Relevant: |
KoboldAI Main / KoboldAI "United" Beta already handle dozens of models for chatAI. Kobold is used as a back-end for TavernAI which is the most popular chat front end next to the billion dollar company CharacterAI. With a proper large character description, characters do not lose continuity or act significantly differently when changing models so long as the model is sufficiently powerful (say 6B+ parameters). As for parameters, KoboldAI has a large amount of pre-sets (https://github.com/henk717/KoboldAI/tree/united/presets) for the dozens of models they support. Oogabooga's textgen webUI took this list of parameter presets and reduced it down to the most different entries to arrive at a shorter (https://github.com/oobabooga/text-generation-webui/tree/main/presets). The TavernAI front-end also has its own parameter presets for Kobold and for NovelAI, but defaults to "use KoboldUI settings" as most people run it with KoboldUI in the background. These presets can be copied and/or OpenCharacters could default to accept textgen webui's generation settings similar to how TavernAI defaults to using KoboldAI's generation settings.
Bit of a tangent, but 4bit LLaMA has identical output quality to fp16 LLaMA and even better inference performance. State of the art 4bit quantization methods ensure no meaningful loss of output quality when quantizing down to 8bit and even 4bit from 16bit. This is discussed in The case for 4-bit precision and GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. Even more off-topic; this means you can get better than GPT-3 175B level transformer performance (LLaMA-13B) out of a 10GB+ consumer GPU and run the current state of the art LLaMA-65B, which rivals PaLM 540B, on 2x3090 or 2x4090 consumer GPUs. |
Does this mean there could be a Google Colab page sometime for people to run models there for free? |
Text-generation-webui has a nice colab which opens in chat mode and runs up to 13B models in a free Colab. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb It also works with LLaMA with these changes: oobabooga/text-generation-webui#217 |
Potentially another option? https://github.com/alexanderatallah/window.ai |
Next update will have support for ~all Hugging Face LLMs via https://github.com/hyperonym/basaran Preliminary doc: https://github.com/josephrocca/OpenCharacters/blob/main/docs/custom-models.md Just waiting on this: And the system is not tied to Basaran, to be clear - works with any OpenAI API-compatible endpoint. |
https://github.com/oobabooga/text-generation-webui/ can be used as a back-end to run dozens of different local models, including the latest LLaMA model. (LLaMA-13B beats GPT-3.5 in benchmarks while fitting in only 10GB of VRAM. LLaMA-30B goes well beyond GPT-3's capabilities while requiring only 20GB of VRAM to run locally.)
The text was updated successfully, but these errors were encountered: