Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Add text-generation-webui as a back-end to support local models #6

Open
MarkSchmidty opened this issue Mar 10, 2023 · 6 comments

Comments

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 10, 2023

https://github.com/oobabooga/text-generation-webui/ can be used as a back-end to run dozens of different local models, including the latest LLaMA model. (LLaMA-13B beats GPT-3.5 in benchmarks while fitting in only 10GB of VRAM. LLaMA-30B goes well beyond GPT-3's capabilities while requiring only 20GB of VRAM to run locally.)

@josephrocca
Copy link
Owner

Yes, definitely want to support more than just OpenAI - and especially open source (ish...) models like llama!

The main question here, I think, is how to make sure characters are "consistent enough" whether people are using llama via a paid API, or locally. E.g. if someone is running a 4 bit 13B llama model and they tune their character nicely, but then someone else gets that character via a sharing link and uses an 8 bit version of the model, or uses inference code that has a slightly different sampling method, or something like that - then they might get significantly different results.

So I guess ideally we'll soon have a very "standard" interface to the llama models (likely huggingface transformers), that'll give us an standard/"objective" set of parameters to use - whether someone is running locally, or via huggingface inference API, or with a docker image on runpod.io, etc.

The pace at which the community is moving with llama suggests that this won't be too far away - if the licensing stuff doesn't get in the way. People should feel free to add relevant community developments to this thread so we can track feasibility here.

Relevant:

@MarkSchmidty
Copy link
Author

MarkSchmidty commented Mar 10, 2023

KoboldAI Main / KoboldAI "United" Beta already handle dozens of models for chatAI. Kobold is used as a back-end for TavernAI which is the most popular chat front end next to the billion dollar company CharacterAI.

With a proper large character description, characters do not lose continuity or act significantly differently when changing models so long as the model is sufficiently powerful (say 6B+ parameters).

As for parameters, KoboldAI has a large amount of pre-sets (https://github.com/henk717/KoboldAI/tree/united/presets) for the dozens of models they support. Oogabooga's textgen webUI took this list of parameter presets and reduced it down to the most different entries to arrive at a shorter (https://github.com/oobabooga/text-generation-webui/tree/main/presets). The TavernAI front-end also has its own parameter presets for Kobold and for NovelAI, but defaults to "use KoboldUI settings" as most people run it with KoboldUI in the background.

These presets can be copied and/or OpenCharacters could default to accept textgen webui's generation settings similar to how TavernAI defaults to using KoboldAI's generation settings.

E.g. if someone is running a 4 bit 13B llama model and they tune their character nicely, but then someone else gets that character via a sharing link and uses an 8 bit version of the mode

Bit of a tangent, but 4bit LLaMA has identical output quality to fp16 LLaMA and even better inference performance. State of the art 4bit quantization methods ensure no meaningful loss of output quality when quantizing down to 8bit and even 4bit from 16bit. This is discussed in The case for 4-bit precision and GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

Even more off-topic; this means you can get better than GPT-3 175B level transformer performance (LLaMA-13B) out of a 10GB+ consumer GPU and run the current state of the art LLaMA-65B, which rivals PaLM 540B, on 2x3090 or 2x4090 consumer GPUs.

@cce1
Copy link

cce1 commented Mar 12, 2023

Yes, definitely want to support more than just OpenAI - and especially open source (ish...) models like llama!

Does this mean there could be a Google Colab page sometime for people to run models there for free?

@MarkSchmidty
Copy link
Author

Does this mean there could be a Google Colab page sometime for people to run models there for free?

Text-generation-webui has a nice colab which opens in chat mode and runs up to 13B models in a free Colab. https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

It also works with LLaMA with these changes: oobabooga/text-generation-webui#217

@josephrocca
Copy link
Owner

Potentially another option? https://github.com/alexanderatallah/window.ai

@josephrocca
Copy link
Owner

josephrocca commented Apr 16, 2023

Next update will have support for ~all Hugging Face LLMs via https://github.com/hyperonym/basaran

Preliminary doc: https://github.com/josephrocca/OpenCharacters/blob/main/docs/custom-models.md

Just waiting on this:

And the system is not tied to Basaran, to be clear - works with any OpenAI API-compatible endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants