Skip to content

Commit 0db7eef

Browse files
committed
Merge remote-tracking branch 'origin/master' into transformers
2 parents 44d48a3 + 39b3541 commit 0db7eef

File tree

13 files changed

+286
-206
lines changed

13 files changed

+286
-206
lines changed

.github/FUNDING.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
ko_fi: turboderp

README.md

Lines changed: 44 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!
77

88
## Hardware requirements
99

10-
I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
11-
incompatibilities with older cards.
10+
I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
11+
anything Pascal or older with poor FP16 support isn't going to perform well.
12+
[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
13+
are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently
14+
have no AMD devices to test or optimize on.
1215

1316
## Dependencies
1417

@@ -43,13 +46,13 @@ Compute Platform version).
4346

4447
## How to
4548

46-
Install dependencies, clone repo and run benchmark:
47-
48-
pip install -r requirements.txt
49+
Clone repo, install dependencies, and run benchmark:
4950

5051
git clone https://github.com/turboderp/exllama
5152
cd exllama
5253

54+
pip install -r requirements.txt
55+
5356
python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
5457

5558
The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the first
@@ -60,11 +63,15 @@ Chatbot example:
6063

6164
python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
6265

66+
## Python module
67+
68+
jllllll currently maintains an installable Python module [here](https://github.com/jllllll/exllama) which may be more
69+
suitable for integrating ExLlama with other projects
70+
6371
## Web UI
6472

65-
I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
66-
it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
67-
multibot mode:
73+
I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
74+
your dreams. But it sort of works, and it's kinda fun, especially multibot mode:
6875

6976
![_screenshot.jpg](doc/_screenshot.jpg)
7077

@@ -74,13 +81,14 @@ To run it:
7481

7582
python webui/app.py -d <path_to_model_files>
7683

77-
Note that sessions are stored in `~/exllama_sessions/`. You can change the location of the sessions storage with `-sd`
78-
if you want.
84+
Note that sessions are stored in `~/exllama_sessions/` by default. You can change that location with `-sd` if you want.
7985

8086
## Docker
87+
8188
For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.
8289

8390
### Requirements
91+
8492
- [Docker](https://docs.docker.com/engine/install/)
8593
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
8694

@@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
128136
## Results so far
129137

130138
### New implementation
131-
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
132-
|------------|-------|-------|-----------------|----------------------|-----------|------------|---------|---------|------|
133-
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
134-
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
135-
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
136-
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
137-
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
138-
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
139-
| WizardLM | 33B | - | no <sup>2</sup> | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
140-
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
139+
| Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
140+
|------------|-------|-------|-----|----------------------|-----------|------------|---------|---------|------|
141+
| Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
142+
| Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
143+
| Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
144+
| Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
145+
| Llama | 33B | 32 | yes | 1,550 t <sup>1</sup> | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
146+
| Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
147+
| WizardLM | 33B | - | yes | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
148+
| OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
141149

142-
<sup>1</sup> Can not achieve full sequence length without OoM (yet)
143-
<sup>2</sup> Not quite sure if this is act-order or not. Weights have no group index, at least
150+
<sup>1</sup> Can not achieve full sequence length without OoM
144151

145152
All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.
146153

@@ -154,66 +161,33 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
154161
internals.
155162

156163
Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
157-
WikiText, so scores are not necessarily comparable to other Llama benchmarks.
164+
WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
165+
models to one another.
158166

159167
### Dual GPU results
160168

161-
Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
162-
following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
163-
164-
| Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
165-
|----------|------|-----------|-----|----------------------|-----------|-----------|--------|--------|------|
166-
| Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
167-
| Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
169+
The following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
168170

171+
| Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
172+
|---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
173+
| Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
174+
| Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
175+
| Llama-2 | 70B | 128 | yes | 2,048 t | 40,680 MB | 914 t/s | 17 t/s | 14 t/s | 4.15 |
176+
| Llama-2 | 70B | 32 | yes | 2,048 t | 36,815 MB | 874 t/s | 15 t/s | 12 t/s | 4.10 |
169177

170-
### Testing long sequences
171-
172-
The following tests were all done on **33B/65B, 4bit 128g** with various settings, just to test the max sequence length
173-
and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating
174-
past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
175-
speeds are no longer current.
176-
177-
| | Size | Seq. len. | VRAM | Long seq. | Ind. |
178-
|------------------------|------|-----------|----------------------|-----------|--------|
179-
| 4090/24GB | 33B | 2,516 t | 22,145 MB | 1140 t/s | 28 t/s |
180-
| 4090/24GB + 3070Ti/8GB | 33B | 3,932 t | 22,055 MB + 7,377 MB | 840 t/s | 22 t/s |
181-
| A6000/48GB (headless) | 33B | 9,032 t | 46,863 MB | 645 t/s | 12 t/s |
182-
| A100/80GB (headless) | 65B | 9,520 t | 79,009 MB | 650 t/s | 9 t/s |
178+
Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
179+
pretraining datasets.
183180

184181
## Todo
185182

186183
Moved the todo list [here](doc/TODO.md).
187184

188185
## Compatibility
189186

190-
I downloaded a whole bunch of GPTQ models to test compatibility. [Here](doc/model_compatibility.md) is the list of models
191-
confirmed to be working right now.
187+
[Here](doc/model_compatibility.md) is a list of models confirmed to be working right now.
192188

193189
## Recent updates
194190

195-
**2023-06-02**: Web UI is now in a fairly working state. Expect it to be a little scuffed in places. There will be a
196-
rewrite at some point to make the client-side code less seizure-inducing. It has multibot mode, chat rewind and editing
197-
features, sessions, and more. I'm going to build it out with support for instruct prompting and such, in time.
198-
199-
**2023-06-04**: Refactored a whole bunch to move more of the work into the extension, setting up for more tuning
200-
options to come soon and eventually auto tuning. Also optimized a little, for about a 5% speedup.
201-
202-
**2023-06-06**: Some minor optimizations. Also it should now compile the extension more easily and run more seamlessly
203-
on Windows.
204-
205-
**2023-06-09**: Fused most of the self-attention step. More to come. Slight speedup already, but more importantly went
206-
from 69% actual CPU utilization to 37%. This should do a lot to address the bottleneck on CPUs with lower
207-
single-threaded performance.
208-
209-
**2023-06-10**: Docker support now! And some minor optimizations. Cleaned up the project a bit.
210-
211-
**2023-06-11**: Added some concurrency a couple of places. It's only beneficial on the 4090, on small models where the
212-
cores are somewhat underutilized and the L2 cache can keep up. For the 3090 it's detrimental to performance, so it's
213-
disabled by default. YMMV. Use `-cs` to try it out.
214-
215-
**2023-06-17**: Fixed a nasty bug in the fused attention that was causing slightly incorrect cache states on 13B and
216-
33B models. You definitely want to update.
217-
218-
**2023-06-18**: LoRA support now. Still needs a lot of testing and some optimization, and currently you can't stack
219-
multiple LoRAs during the same inference. There's also no support in the web UI yet.
191+
**2023-07-19**: Added support for grouped-query attention and Llama-2 70b. There's still a bit of optimization to do,
192+
since it slows down considerably on very long sequences despite GQA having the potential to be faster. Also could use
193+
some more thorough testing.

doc/TODO.md

Lines changed: 15 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,46 @@
11
## Model compatibility
22

3-
- [x] Support for act-order models ~~(a bit slow for now)~~
4-
- [x] ~~Support for v1 models without groupsize~~ Nah.
5-
- [x] Test more models
6-
- [x] Consider support for loading GGML models (not feasible)
7-
- [x] Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
3+
- [ ] Verify compatibility with Llama-2 34B once released
84

95
## GPU compatibility (etc.)
106

11-
- [x] Support for ROCm/AMD GPUs
12-
- [ ] Optimize more for ROCm
13-
- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
14-
- [x] Test performance on P40 (would be a good GPU to support)
15-
- [ ] Improve performance on P40
16-
- [x] Tunable kernel parameters
17-
- [ ] More tunable kernel parameters
18-
- [x] Test on Windows
19-
- [x] Easier extension loading on Windows
20-
- [x] Setup instructions for Windows
7+
- [ ] Optimizations for ROCm
8+
- [ ] Optimizations for RTX 20-series maybe
9+
- [ ] Look into improving P40 performance
2110

2211
## Testing
2312

24-
- [x] Figure out an apples-to-apples way of comparing perplexity with other implementations
25-
- [ ] Compile charts of inference speed vs context length for variety of models, compare to other implementations
26-
- [ ] Test a bunch of LoRAs to make sure all combinations of rank and target layers work
13+
- [ ] More testing on Llama 2 models
2714

28-
## VRAM optimization
15+
## Optimization
2916

30-
- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
31-
- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
32-
- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah
33-
34-
## Speed optimization
35-
36-
- [x] Support for de-quantizing select matrices at load time
37-
- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
38-
- [x] Fused QKV projection
39-
- [x] Fused MLP
40-
- [x] Fused RoPE
41-
- [x] ~~Build attention mask in CUDA rather than PyTorch~~
42-
- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
43-
- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
44-
- [x] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
45-
- [x] Measure PyTorch module overhead (negligible in eval mode)
46-
- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
47-
- [ ] Implement attention in CUDA
48-
- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
49-
- [x] Experiment with concurrent streams where possible (fused MLP and QKV proj.)
50-
- [x] Faster low-rank matmul to speed up LoRAs
17+
- [ ] Flash Attention 2.0 (?)
18+
- [ ] Find a way to eliminate `ExLlamaAttention.repeat_kv` (custom attention kernel?)
19+
- [ ] C++ implementations of sampler functions
5120

5221
## Generation
5322

54-
- [x] Memory-efficient beam search implementation
55-
- [ ] Optimized beam search
56-
- [ ] Multi-token censoring/de-censoring
57-
- [ ] Multi-token repetition penalties
58-
- [x] (Multi) LoRA support
23+
- [ ] Optimized/batched beam search
5924
- [ ] Allow stackable LoRAs
60-
- [x] Guided generation (chat with multiple bots at once, etc.)
61-
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
62-
- [ ] Batched generation
25+
- [ ] Guidance or equivalent
6326

6427
## Interface
6528

66-
- [x] Simple web interface?
67-
- [ ] API server
29+
- [ ] Comprehensive API server (more than `example_flask.py`
6830

6931
## Web UI
7032

7133
- [ ] Controls to enable beam search
7234
- [ ] Rewrite/refactor all the JavaScript and CSS
73-
- [ ] Support for prompt formats/instruct mode
7435
- [ ] Make it a little prettier
75-
- [ ] Test various edge cases
7636
- [ ] Better error handling
7737
- [ ] LoRA controls
38+
- [ ] Multiple chat modes with prompt templates (instruct, etc.)
7839

7940
## ??
8041

81-
- [ ] FP8/FP16 overlays
42+
- [ ] Support for other quantization methods
43+
- [ ] Support for other LLM architectures
8244
- [ ] Allow for backpropagation
8345
- [ ] LoRA training features
8446
- [ ] Soft prompt training

doc/model_compatibility.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Working models
22

3-
As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be working:
3+
As of **2023-07-19**, the following GPTQ models on HuggingFace all appear to be working:
44

55
- iambestfeed/open_llama_3b_4bit_128g
66
- Neko-Institute-of-Science/LLaMA-7B-4bit-128g
@@ -9,6 +9,7 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
99
- Neko-Institute-of-Science/LLaMA-30B-4bit-128g
1010
- Neko-Institute-of-Science/LLaMA-65B-4bit-32g
1111
- Neko-Institute-of-Science/LLaMA-65B-4bit-128g
12+
- Panchovix/LLaMA-2-70B-GPTQ-transformers4.32.0.dev0
1213
- reeducator/bluemoonrp-13b
1314
- reeducator/bluemoonrp-30b
1415
- TehVenom/Metharme-13b-4bit-GPTQ
@@ -17,8 +18,11 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
1718
- TheBloke/GPT4All-13B-snoozy-GPTQ
1819
- TheBloke/guanaco-33B-GPTQ
1920
- TheBloke/guanaco-65B-GPTQ
20-
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ <sup>1</sup>
21+
- TheBloke/h2ogpt-oasst1-512-30B-GPTQ
2122
- TheBloke/koala-13B-GPTQ-4bit-128g
23+
- TheBloke/Llama-2-13B-chat-GPTQ (128g)
24+
- TheBloke/Llama-2-13B-GPTQ (32g, 64g, 128g)
25+
- TheBloke/Llama-2-70B-GPTQ (32g, 128g)
2226
- TheBloke/Manticore-13B-GPTQ
2327
- TheBloke/medalpaca-13B-GPTQ-4bit
2428
- TheBloke/medalpaca-13B-GPTQ-4bit (compat version)
@@ -39,11 +43,6 @@ As of **2023-07-02**, the following GPTQ models on HuggingFace all appear to be
3943
- Yhyu13/chimera-inst-chat-13b-gptq-4bit
4044
- Yhyu13/oasst-rlhf-2-llama-30b-7k-steps-gptq-4bit
4145

42-
<sup>1</sup> This particular model, uniquely, shows somewhat worse perplexity when matmul is done by the custom CUDA
43-
kernel rather than cuBLAS. Maybe it's extra sensitive to rounding errors for some reason? Either way, it does work.
44-
4546
## Non-working models
4647

47-
As of **2023-07-02**, I have found no models that don't work.
48-
49-
v1 models are still unsupported, as are pickle files.
48+
None as of **2023-07-19**.

exllama/lora.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ def __init__(self, model, lora_config_path, lora_path):
116116

117117
# Move to target device
118118

119-
device = self.config.device_map.map(target_key, loading = True)
119+
device = self.config.device_map.map(target_key)
120120
tensor = tensor.to(device, non_blocking = True)
121121

122122
# Store adapter tensor

0 commit comments

Comments
 (0)