0cc4m
diff --git a/‎.github/FUNDING.yml
Lines changed: 1 addition & 0 deletions b/‎.github/FUNDING.yml
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md
Lines changed: 44 additions & 70 deletions b/‎README.md
Lines changed: 44 additions & 70 deletions
diff --git a/‎doc/TODO.md
Lines changed: 15 additions & 53 deletions b/‎doc/TODO.md
Lines changed: 15 additions & 53 deletions
diff --git a/‎doc/model_compatibility.md
Lines changed: 7 additions & 8 deletions b/‎doc/model_compatibility.md
Lines changed: 7 additions & 8 deletions
diff --git a/‎exllama/lora.py
Lines changed: 1 addition & 1 deletion b/‎exllama/lora.py
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1 @@
+ko_fi: turboderp
@@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!
 
 ## Hardware requirements
 
-I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
-incompatibilities with older cards.
+I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
+anything Pascal or older with poor FP16 support isn't going to perform well. 
+[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
+are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently 
+have no AMD devices to test or optimize on.
 
 ## Dependencies
 
@@ -43,13 +46,13 @@ Compute Platform version).
 
 ## How to
 
-Install dependencies, clone repo and run benchmark:
-
-    pip install -r requirements.txt
+Clone repo, install dependencies, and run benchmark:
 
     git clone https://github.com/turboderp/exllama
     cd exllama
 
+    pip install -r requirements.txt
+
     python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
 
 The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the first
@@ -60,11 +63,15 @@ Chatbot example:
 
     python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
 
+## Python module
+
+jllllll currently maintains an installable Python module [here](https://github.com/jllllll/exllama) which may be more
+suitable for integrating ExLlama with other projects
+
 ## Web UI
 
-I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
-it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
-multibot mode:
+I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
+your dreams. But it sort of works, and it's kinda fun, especially multibot mode:
 
 ![_screenshot.jpg](doc/_screenshot.jpg)
 
@@ -74,13 +81,14 @@ To run it:
 
     python webui/app.py -d <path_to_model_files>
 
-Note that sessions are stored in `~/exllama_sessions/`. You can change the location of the sessions storage with `-sd`
-if you want.
+Note that sessions are stored in `~/exllama_sessions/` by default. You can change that location with `-sd` if you want.
 
 ## Docker
+
 For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.
 
 ### Requirements
+
 - [Docker](https://docs.docker.com/engine/install/)
 - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
 
@@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
 ## Results so far
 
 ### New implementation
-| Model      | Size  | grpsz | act             | Seq. len.            | VRAM      | Prompt     | Best    | Worst   | Ppl  |
-|------------|-------|-------|-----------------|----------------------|-----------|------------|---------|---------|------|
-| Llama      | 7B    | 128   | no              | 2,048 t              | 5,194 MB  | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
-| Llama      | 13B   | 128   | no              | 2,048 t              | 9,127 MB  | 7,507 t/s  | 102 t/s | 86 t/s  | 5.60 |
-| Llama      | 33B   | 128   | no              | 2,048 t              | 20,795 MB | 2,959 t/s  | 47 t/s  | 40 t/s  | 4.60 |
-| Llama      | 33B   | 128   | yes             | 2,048 t              | 20,795 MB | 2,784 t/s  | 45 t/s  | 37 t/s  | 4.55 |
-| Llama      | 33B   | 32    | yes             | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s  | 41 t/s  | 37 t/s  | 4.52 |
-| Koala      | 13B   | 128   | yes             | 2,048 t              | 9,127 MB  | 5,529 t/s  | 93 t/s  | 79 t/s  | 6.73 |
-| WizardLM   | 33B   | -     | no <sup>2</sup> | 2,048 t              | 20,199 MB | 2,313 t/s  | 47 t/s  | 40 t/s  | 5.75 |
-| OpenLlama  | 3B    | 128   | yes             | 2,048 t              | 3,128 MB  | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
+| Model      | Size  | grpsz | act | Seq. len.            | VRAM      | Prompt     | Best    | Worst   | Ppl  |
+|------------|-------|-------|-----|----------------------|-----------|------------|---------|---------|------|
+| Llama      | 7B    | 128   | no  | 2,048 t              | 5,194 MB  | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
+| Llama      | 13B   | 128   | no  | 2,048 t              | 9,127 MB  | 7,507 t/s  | 102 t/s | 86 t/s  | 5.60 |
+| Llama      | 33B   | 128   | no  | 2,048 t              | 20,795 MB | 2,959 t/s  | 47 t/s  | 40 t/s  | 4.60 |
+| Llama      | 33B   | 128   | yes | 2,048 t              | 20,795 MB | 2,784 t/s  | 45 t/s  | 37 t/s  | 4.55 |
+| Llama      | 33B   | 32    | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s  | 41 t/s  | 37 t/s  | 4.52 |
+| Koala      | 13B   | 128   | yes | 2,048 t              | 9,127 MB  | 5,529 t/s  | 93 t/s  | 79 t/s  | 6.73 |
+| WizardLM   | 33B   | -     | yes | 2,048 t              | 20,199 MB | 2,313 t/s  | 47 t/s  | 40 t/s  | 5.75 |
+| OpenLlama  | 3B    | 128   | yes | 2,048 t              | 3,128 MB  | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
 
-<sup>1</sup> Can not achieve full sequence length without OoM (yet)  
-<sup>2</sup> Not quite sure if this is act-order or not. Weights have no group index, at least   
+<sup>1</sup> Can not achieve full sequence length without OoM  
 
 All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.
 
@@ -154,66 +161,33 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
 internals.
 
 Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
-WikiText, so scores are not necessarily comparable to other Llama benchmarks.
+WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
+models to one another.
 
 ### Dual GPU results
 
-Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
-following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
-
-| Model    | Size | groupsize | act | Seq. len.            | VRAM      | Prompt    | Best   | Worst  | Ppl  |
-|----------|------|-----------|-----|----------------------|-----------|-----------|--------|--------|------|
-| Llama    | 65B  | 128       | yes | 2,048 t              | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
-| Llama    | 65B  | 32        | yes | 2,048 t              | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
+The following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
 
+| Model   | Size | groupsize | act | Seq. len.      | VRAM      | Prompt    | Best   | Worst   | Ppl   |
+|---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
+| Llama   | 65B  | 128       | yes | 2,048 t        | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s  | 4.20  |
+| Llama   | 65B  | 32        | yes | 2,048 t        | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s  | 4.11  |
+| Llama-2 | 70B  | 128       | yes | 2,048 t        | 40,680 MB | 914 t/s   | 17 t/s | 14 t/s  | 4.15  |
+| Llama-2 | 70B  | 32        | yes | 2,048 t        | 36,815 MB | 874 t/s   | 15 t/s | 12 t/s  | 4.10  |
 
-### Testing long sequences
-
-The following tests were all done on **33B/65B, 4bit 128g** with various settings, just to test the max sequence length
-and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating 
-past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
-speeds are no longer current.
-
-|                        | Size | Seq. len. | VRAM                 | Long seq. | Ind.   | 
-|------------------------|------|-----------|----------------------|-----------|--------|
-| 4090/24GB              | 33B  | 2,516 t   | 22,145 MB            | 1140 t/s  | 28 t/s |
-| 4090/24GB + 3070Ti/8GB | 33B  | 3,932 t   | 22,055 MB + 7,377 MB | 840 t/s   | 22 t/s |
-| A6000/48GB (headless)  | 33B  | 9,032 t   | 46,863 MB            | 645 t/s   | 12 t/s |
-| A100/80GB (headless)   | 65B  | 9,520 t   | 79,009 MB            | 650 t/s   | 9 t/s  |
+Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
+pretraining datasets.
 
 ## Todo
 
 Moved the todo list [here](doc/TODO.md).  
 
 ## Compatibility
 
-I downloaded a whole bunch of GPTQ models to test compatibility. [Here](doc/model_compatibility.md) is the list of models
-confirmed to be working right now.
+[Here](doc/model_compatibility.md) is a list of models confirmed to be working right now.
 
 ## Recent updates
 
-**2023-06-02**: Web UI is now in a fairly working state. Expect it to be a little scuffed in places. There will be a
-rewrite at some point to make the client-side code less seizure-inducing. It has multibot mode, chat rewind and editing
-features, sessions, and more. I'm going to build it out with support for instruct prompting and such, in time.
-
-**2023-06-04**: Refactored a whole bunch to move more of the work into the extension, setting up for more tuning
-options to come soon and eventually auto tuning. Also optimized a little, for about a 5% speedup.
-
-**2023-06-06**: Some minor optimizations. Also it should now compile the extension more easily and run more seamlessly
-on Windows.
-
-**2023-06-09**: Fused most of the self-attention step. More to come. Slight speedup already, but more importantly went
-from 69% actual CPU utilization to 37%. This should do a lot to address the bottleneck on CPUs with lower 
-single-threaded performance.
-
-**2023-06-10**: Docker support now! And some minor optimizations. Cleaned up the project a bit.
-
-**2023-06-11**: Added some concurrency a couple of places. It's only beneficial on the 4090, on small models where the
-cores are somewhat underutilized and the L2 cache can keep up. For the 3090 it's detrimental to performance, so it's
-disabled by default. YMMV. Use `-cs` to try it out.
-
-**2023-06-17**: Fixed a nasty bug in the fused attention that was causing slightly incorrect cache states on 13B and
-33B models. You definitely want to update.
-
-**2023-06-18**: LoRA support now. Still needs a lot of testing and some optimization, and currently you can't stack
-multiple LoRAs during the same inference. There's also no support in the web UI yet.
+**2023-07-19**: Added support for grouped-query attention and Llama-2 70b. There's still a bit of optimization to do,
+since it slows down considerably on very long sequences despite GQA having the potential to be faster. Also could use
+some more thorough testing.
@@ -1,84 +1,46 @@
 ## Model compatibility
 
-- [x] Support for act-order models ~~(a bit slow for now)~~
-- [x] ~~Support for v1 models without groupsize~~ Nah.
-- [x] Test more models
-- [x] Consider support for loading GGML models (not feasible)
-- [x] Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
+- [ ] Verify compatibility with Llama-2 34B once released
 
 ## GPU compatibility (etc.)
 
-- [x] Support for ROCm/AMD GPUs
-- [ ] Optimize more for ROCm
-- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
-- [x] Test performance on P40 (would be a good GPU to support)
-- [ ] Improve performance on P40
-- [x] Tunable kernel parameters
-- [ ] More tunable kernel parameters
-- [x] Test on Windows
-- [x] Easier extension loading on Windows
-- [x] Setup instructions for Windows
+- [ ] Optimizations for ROCm
+- [ ] Optimizations for RTX 20-series maybe
+- [ ] Look into improving P40 performance
 
 ## Testing
 
-- [x] Figure out an apples-to-apples way of comparing perplexity with other implementations
-- [ ] Compile charts of inference speed vs context length for variety of models, compare to other implementations
-- [ ] Test a bunch of LoRAs to make sure all combinations of rank and target layers work
+- [ ] More testing on Llama 2 models
 
-## VRAM optimization
+## Optimization
 
-- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
-- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
-- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah
-
-## Speed optimization
-
-- [x] Support for de-quantizing select matrices at load time
-- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
-- [x] Fused QKV projection
-- [x] Fused MLP
-- [x] Fused RoPE
-- [x] ~~Build attention mask in CUDA rather than PyTorch~~
-- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
-- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
-- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
-- [x] Measure PyTorch module overhead (negligible in eval mode)
-- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
-- [ ] Implement attention in CUDA
-- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
-- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
-- [x] Faster low-rank matmul to speed up LoRAs
+- [ ] Flash Attention 2.0 (?)
+- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
+- [ ] C++ implementations of sampler functions
 
 ## Generation
 
-- [x] Memory-efficient beam search implementation
-- [ ] Optimized beam search
-- [ ] Multi-token censoring/de-censoring
-- [ ] Multi-token repetition penalties
-- [x] (Multi) LoRA support
+- [ ] Optimized/batched beam search
 - [ ] Allow stackable LoRAs
-- [x] Guided generation (chat with multiple bots at once, etc.)
-- [ ] Multiple chat modes with prompt templates (instruct, etc.)
-- [ ] Batched generation
+- [ ] Guidance or equivalent
 
 ## Interface
 
-- [x] Simple web interface?
-- [ ] API server
+- [ ] Comprehensive API server (more than `example_flask.py`
 
 ## Web UI
 
 - [ ] Controls to enable beam search
 - [ ] Rewrite/refactor all the JavaScript and CSS
-- [ ] Support for prompt formats/instruct mode
 - [ ] Make it a little prettier
-- [ ] Test various edge cases
 - [ ] Better error handling
 - [ ] LoRA controls
+- [ ] Multiple chat modes with prompt templates (instruct, etc.)
 
 ## ??
 
-- [ ] FP8/FP16 overlays
+- [ ] Support for other quantization methods
+- [ ] Support for other LLM architectures
 - [ ] Allow for backpropagation
 - [ ] LoRA training features
 - [ ] Soft prompt training
@@ -1,6 +1,6 @@
 ## Working models
 
-As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be working:
+As of **2023-07-19**, the following GPTQ models on HuggingFace all appear to be working:
 
 - iambestfeed/open_llama_3b_4bit_128g
 - Neko-Institute-of-Science/LLaMA-7B-4bit-128g
@@ -9,6 +9,7 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - Neko-Institute-of-Science/LLaMA-30B-4bit-128g
 - Neko-Institute-of-Science/LLaMA-65B-4bit-32g
 - Neko-Institute-of-Science/LLaMA-65B-4bit-128g
+- Panchovix/LLaMA-2-70B-GPTQ-transformers4.32.0.dev0
 - reeducator/bluemoonrp-13b
 - reeducator/bluemoonrp-30b
 - TehVenom/Metharme-13b-4bit-GPTQ
@@ -17,8 +18,11 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - TheBloke/GPT4All-13B-snoozy-GPTQ
 - TheBloke/guanaco-33B-GPTQ
 - TheBloke/guanaco-65B-GPTQ
-- TheBloke/h2ogpt-oasst1-512-30B-GPTQ <sup>1</sup> 
+- TheBloke/h2ogpt-oasst1-512-30B-GPTQ
 - TheBloke/koala-13B-GPTQ-4bit-128g
+- TheBloke/Llama-2-13B-chat-GPTQ (128g)
+- TheBloke/Llama-2-13B-GPTQ (32g, 64g, 128g)
+- TheBloke/Llama-2-70B-GPTQ (32g, 128g)
 - TheBloke/Manticore-13B-GPTQ
 - TheBloke/medalpaca-13B-GPTQ-4bit
 - TheBloke/medalpaca-13B-GPTQ-4bit (compat version)
@@ -39,11 +43,6 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
 - Yhyu13/chimera-inst-chat-13b-gptq-4bit
 - Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-gptq-4bit
 
-<sup>1</sup> This particular model, uniquely, shows somewhat worse perplexity when matmul is done by the custom CUDA 
-kernel rather than cuBLAS. Maybe it's extra sensitive to rounding errors for some reason? Either way, it does work.
-
 ## Non-working models
 
-As of **2023-07-02**, I have found no models that don't work.
-
-v1 models are still unsupported, as are pickle files.
+None as of **2023-07-19**.
@@ -116,7 +116,7 @@ def __init__(self, model, lora_config_path, lora_path):
 
             # Move to target device
 
-            device = self.config.device_map.map(target_key, loading = True)
+            device = self.config.device_map.map(target_key)
             tensor = tensor.to(device, non_blocking = True)
 
             # Store adapter tensor