@@ -7,8 +7,11 @@ Disclaimer: The project is coming along, but it's still a work in progress!
7
7
8
8
## Hardware requirements
9
9
10
- I am developing on an RTX 4090 and an RTX 3090-Ti. Both cards support the CUDA kernels, but there might be
11
- incompatibilities with older cards.
10
+ I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but
11
+ anything Pascal or older with poor FP16 support isn't going to perform well.
12
+ [ AutoGPTQ] ( https://github.com/PanQiWei/AutoGPTQ ) or [ GPTQ-for-LLaMa] ( https://github.com/qwopqwop200/GPTQ-for-LLaMa )
13
+ are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently
14
+ have no AMD devices to test or optimize on.
12
15
13
16
## Dependencies
14
17
@@ -43,13 +46,13 @@ Compute Platform version).
43
46
44
47
## How to
45
48
46
- Install dependencies, clone repo and run benchmark:
47
-
48
- pip install -r requirements.txt
49
+ Clone repo, install dependencies, and run benchmark:
49
50
50
51
git clone https://github.com/turboderp/exllama
51
52
cd exllama
52
53
54
+ pip install -r requirements.txt
55
+
53
56
python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
54
57
55
58
The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the first
@@ -60,11 +63,15 @@ Chatbot example:
60
63
61
64
python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
62
65
66
+ ## Python module
67
+
68
+ jllllll currently maintains an installable Python module [ here] ( https://github.com/jllllll/exllama ) which may be more
69
+ suitable for integrating ExLlama with other projects
70
+
63
71
## Web UI
64
72
65
- I made a simple web UI for it. Like the rest of the project, it's a work in progress. Don't look at the JavaScript,
66
- it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially
67
- multibot mode:
73
+ I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt
74
+ your dreams. But it sort of works, and it's kinda fun, especially multibot mode:
68
75
69
76
![ _ screenshot.jpg] ( doc/_screenshot.jpg )
70
77
@@ -74,13 +81,14 @@ To run it:
74
81
75
82
python webui/app.py -d <path_to_model_files>
76
83
77
- Note that sessions are stored in ` ~/exllama_sessions/ ` . You can change the location of the sessions storage with ` -sd `
78
- if you want.
84
+ Note that sessions are stored in ` ~/exllama_sessions/ ` by default. You can change that location with ` -sd ` if you want.
79
85
80
86
## Docker
87
+
81
88
For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.
82
89
83
90
### Requirements
91
+
84
92
- [ Docker] ( https://docs.docker.com/engine/install/ )
85
93
- [ NVIDIA Container Toolkit] ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html )
86
94
@@ -128,19 +136,18 @@ docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_
128
136
## Results so far
129
137
130
138
### New implementation
131
- | Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
132
- | ------------| -------| -------| ----------------- | ----------------------| -----------| ------------| ---------| ---------| ------|
133
- | Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
134
- | Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
135
- | Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
136
- | Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
137
- | Llama | 33B | 32 | yes | 1,550 t <sup >1</sup > | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
138
- | Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
139
- | WizardLM | 33B | - | no < sup >2</ sup > | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
140
- | OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
139
+ | Model | Size | grpsz | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
140
+ | ------------| -------| -------| -----| ----------------------| -----------| ------------| ---------| ---------| ------|
141
+ | Llama | 7B | 128 | no | 2,048 t | 5,194 MB | 13,918 t/s | 173 t/s | 140 t/s | 6.45 |
142
+ | Llama | 13B | 128 | no | 2,048 t | 9,127 MB | 7,507 t/s | 102 t/s | 86 t/s | 5.60 |
143
+ | Llama | 33B | 128 | no | 2,048 t | 20,795 MB | 2,959 t/s | 47 t/s | 40 t/s | 4.60 |
144
+ | Llama | 33B | 128 | yes | 2,048 t | 20,795 MB | 2,784 t/s | 45 t/s | 37 t/s | 4.55 |
145
+ | Llama | 33B | 32 | yes | 1,550 t <sup >1</sup > | 21,486 MB | 2,636 t/s | 41 t/s | 37 t/s | 4.52 |
146
+ | Koala | 13B | 128 | yes | 2,048 t | 9,127 MB | 5,529 t/s | 93 t/s | 79 t/s | 6.73 |
147
+ | WizardLM | 33B | - | yes | 2,048 t | 20,199 MB | 2,313 t/s | 47 t/s | 40 t/s | 5.75 |
148
+ | OpenLlama | 3B | 128 | yes | 2,048 t | 3,128 MB | 16,419 t/s | 226 t/s | 170 t/s | 7.81 |
141
149
142
- <sup >1</sup > Can not achieve full sequence length without OoM (yet)
143
- <sup >2</sup > Not quite sure if this is act-order or not. Weights have no group index, at least
150
+ <sup >1</sup > Can not achieve full sequence length without OoM
144
151
145
152
All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.
146
153
@@ -154,66 +161,33 @@ probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop e
154
161
internals.
155
162
156
163
Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from
157
- WikiText, so scores are not necessarily comparable to other Llama benchmarks.
164
+ WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama
165
+ models to one another.
158
166
159
167
### Dual GPU results
160
168
161
- Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
162
- following benchmarks are from a 4090 + 3090-Ti with ` -gs 17.2,24 ` :
163
-
164
- | Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
165
- | ----------| ------| -----------| -----| ----------------------| -----------| -----------| --------| --------| ------|
166
- | Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
167
- | Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
169
+ The following benchmarks are from a 4090 + 3090-Ti with ` -gs 17.2,24 ` :
168
170
171
+ | Model | Size | groupsize | act | Seq. len. | VRAM | Prompt | Best | Worst | Ppl |
172
+ | ---------| ------| -----------| -----| ----------------| -----------| -----------| --------| ---------| -------|
173
+ | Llama | 65B | 128 | yes | 2,048 t | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
174
+ | Llama | 65B | 32 | yes | 2,048 t | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
175
+ | Llama-2 | 70B | 128 | yes | 2,048 t | 40,680 MB | 914 t/s | 17 t/s | 14 t/s | 4.15 |
176
+ | Llama-2 | 70B | 32 | yes | 2,048 t | 36,815 MB | 874 t/s | 15 t/s | 12 t/s | 4.10 |
169
177
170
- ### Testing long sequences
171
-
172
- The following tests were all done on ** 33B/65B, 4bit 128g** with various settings, just to test the max sequence length
173
- and get a sense of what can be achieved with different or multiple GPUs right now. Llama goes incoherent generating
174
- past 2048 tokens anyway, but with some fine-tuning, who knows? Note that these tests were run a while ago and the
175
- speeds are no longer current.
176
-
177
- | | Size | Seq. len. | VRAM | Long seq. | Ind. |
178
- | ------------------------| ------| -----------| ----------------------| -----------| --------|
179
- | 4090/24GB | 33B | 2,516 t | 22,145 MB | 1140 t/s | 28 t/s |
180
- | 4090/24GB + 3070Ti/8GB | 33B | 3,932 t | 22,055 MB + 7,377 MB | 840 t/s | 22 t/s |
181
- | A6000/48GB (headless) | 33B | 9,032 t | 46,863 MB | 645 t/s | 12 t/s |
182
- | A100/80GB (headless) | 65B | 9,520 t | 79,009 MB | 650 t/s | 9 t/s |
178
+ Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different
179
+ pretraining datasets.
183
180
184
181
## Todo
185
182
186
183
Moved the todo list [ here] ( doc/TODO.md ) .
187
184
188
185
## Compatibility
189
186
190
- I downloaded a whole bunch of GPTQ models to test compatibility. [ Here] ( doc/model_compatibility.md ) is the list of models
191
- confirmed to be working right now.
187
+ [ Here] ( doc/model_compatibility.md ) is a list of models confirmed to be working right now.
192
188
193
189
## Recent updates
194
190
195
- ** 2023-06-02** : Web UI is now in a fairly working state. Expect it to be a little scuffed in places. There will be a
196
- rewrite at some point to make the client-side code less seizure-inducing. It has multibot mode, chat rewind and editing
197
- features, sessions, and more. I'm going to build it out with support for instruct prompting and such, in time.
198
-
199
- ** 2023-06-04** : Refactored a whole bunch to move more of the work into the extension, setting up for more tuning
200
- options to come soon and eventually auto tuning. Also optimized a little, for about a 5% speedup.
201
-
202
- ** 2023-06-06** : Some minor optimizations. Also it should now compile the extension more easily and run more seamlessly
203
- on Windows.
204
-
205
- ** 2023-06-09** : Fused most of the self-attention step. More to come. Slight speedup already, but more importantly went
206
- from 69% actual CPU utilization to 37%. This should do a lot to address the bottleneck on CPUs with lower
207
- single-threaded performance.
208
-
209
- ** 2023-06-10** : Docker support now! And some minor optimizations. Cleaned up the project a bit.
210
-
211
- ** 2023-06-11** : Added some concurrency a couple of places. It's only beneficial on the 4090, on small models where the
212
- cores are somewhat underutilized and the L2 cache can keep up. For the 3090 it's detrimental to performance, so it's
213
- disabled by default. YMMV. Use ` -cs ` to try it out.
214
-
215
- ** 2023-06-17** : Fixed a nasty bug in the fused attention that was causing slightly incorrect cache states on 13B and
216
- 33B models. You definitely want to update.
217
-
218
- ** 2023-06-18** : LoRA support now. Still needs a lot of testing and some optimization, and currently you can't stack
219
- multiple LoRAs during the same inference. There's also no support in the web UI yet.
191
+ ** 2023-07-19** : Added support for grouped-query attention and Llama-2 70b. There's still a bit of optimization to do,
192
+ since it slows down considerably on very long sequences despite GQA having the potential to be faster. Also could use
193
+ some more thorough testing.
0 commit comments