Skip to content

Commit a785000

Browse files
authored
Add initial support for compressed-tensors checkpoints (#2732)
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
1 parent 97f7a22 commit a785000

28 files changed

+2052
-78
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -247,7 +247,7 @@ COPY server/Makefile server/Makefile
247247
RUN cd server && \
248248
make gen-server && \
249249
pip install -r requirements_cuda.txt && \
250-
pip install ".[bnb, accelerate, marlin, moe, quantize, peft, outlines]" --no-cache-dir && \
250+
pip install ".[bnb, accelerate, compressed-tensors, marlin, moe, quantize, peft, outlines]" --no-cache-dir && \
251251
pip install nvidia-nccl-cu12==2.22.3
252252

253253
ENV LD_PRELOAD=/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2

Dockerfile_amd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,7 @@ COPY server/Makefile server/Makefile
296296
RUN cd server && \
297297
make gen-server && \
298298
pip install -r requirements_rocm.txt && \
299-
pip install ".[accelerate, peft, outlines]" --no-cache-dir
299+
pip install ".[accelerate, compressed-tensors, peft, outlines]" --no-cache-dir
300300

301301
# Install benchmarker
302302
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark

Dockerfile_intel

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ COPY server/Makefile server/Makefile
102102
RUN cd server && \
103103
make gen-server && \
104104
pip install -r requirements_intel.txt && \
105-
pip install ".[accelerate, peft, outlines]" --no-cache-dir
105+
pip install ".[accelerate, compressed-tensors, peft, outlines]" --no-cache-dir
106106

107107
ENV CCL_ROOT=/opt/intel/oneapi/ccl/latest
108108
ENV I_MPI_ROOT=/opt/intel/oneapi/mpi/latest

docs/source/reference/launcher.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -62,15 +62,16 @@ Options:
6262
[env: QUANTIZE=]
6363

6464
Possible values:
65-
- awq: 4 bit quantization. Requires a specific AWQ quantized model: <https://hf.co/models?search=awq>. Should replace GPTQ models wherever possible because of the better latency
66-
- eetq: 8 bit quantization, doesn't require specific model. Should be a drop-in replacement to bitsandbytes with much better performance. Kernels are from <https://github.com/NetEase-FuXi/EETQ.git>
67-
- exl2: Variable bit quantization. Requires a specific EXL2 quantized model: <https://hf.co/models?search=exl2>. Requires exllama2 kernels and does not support tensor parallelism (num_shard > 1)
68-
- gptq: 4 bit quantization. Requires a specific GTPQ quantized model: <https://hf.co/models?search=gptq>. text-generation-inference will use exllama (faster) kernels wherever possible, and use triton kernel (wider support) when it's not. AWQ has faster kernels
69-
- marlin: 4 bit quantization. Requires a specific Marlin quantized model: <https://hf.co/models?search=marlin>
70-
- bitsandbytes: Bitsandbytes 8bit. Can be applied on any model, will cut the memory requirement in half, but it is known that the model will be much slower to run than the native f16
71-
- bitsandbytes-nf4: Bitsandbytes 4bit. Can be applied on any model, will cut the memory requirement by 4x, but it is known that the model will be much slower to run than the native f16
72-
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
73-
- fp8: [FP8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) (e4m3) works on H100 and above This dtype has native ops should be the fastest if available. This is currently not the fastest because of local unpacking + padding to satisfy matrix multiplication limitations
65+
- awq: 4 bit quantization. Requires a specific AWQ quantized model: <https://hf.co/models?search=awq>. Should replace GPTQ models wherever possible because of the better latency
66+
- compressed-tensors: Compressed tensors, which can be a mixture of different quantization methods
67+
- eetq: 8 bit quantization, doesn't require specific model. Should be a drop-in replacement to bitsandbytes with much better performance. Kernels are from <https://github.com/NetEase-FuXi/EETQ.git>
68+
- exl2: Variable bit quantization. Requires a specific EXL2 quantized model: <https://hf.co/models?search=exl2>. Requires exllama2 kernels and does not support tensor parallelism (num_shard > 1)
69+
- gptq: 4 bit quantization. Requires a specific GTPQ quantized model: <https://hf.co/models?search=gptq>. text-generation-inference will use exllama (faster) kernels wherever possible, and use triton kernel (wider support) when it's not. AWQ has faster kernels
70+
- marlin: 4 bit quantization. Requires a specific Marlin quantized model: <https://hf.co/models?search=marlin>
71+
- bitsandbytes: Bitsandbytes 8bit. Can be applied on any model, will cut the memory requirement in half, but it is known that the model will be much slower to run than the native f16
72+
- bitsandbytes-nf4: Bitsandbytes 4bit. Can be applied on any model, will cut the memory requirement by 4x, but it is known that the model will be much slower to run than the native f16
73+
- bitsandbytes-fp4: Bitsandbytes 4bit. nf4 should be preferred in most cases but maybe this one has better perplexity performance for you model
74+
- fp8: [FP8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) (e4m3) works on H100 and above This dtype has native ops should be the fastest if available. This is currently not the fastest because of local unpacking + padding to satisfy matrix multiplication limitations
7475

7576
```
7677
## SPECULATE

flake.lock

Lines changed: 4 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

flake.nix

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
inputs.nixpkgs.follows = "tgi-nix/nixpkgs";
66
};
77
nix-filter.url = "github:numtide/nix-filter";
8-
tgi-nix.url = "github:huggingface/text-generation-inference-nix";
8+
tgi-nix.url = "github:huggingface/text-generation-inference-nix/compressed-tensors-0.7.1";
99
nixpkgs.follows = "tgi-nix/nixpkgs";
1010
flake-utils.url = "github:numtide/flake-utils";
1111
rust-overlay = {
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
{
2+
"details": {
3+
"best_of_sequences": null,
4+
"finish_reason": "length",
5+
"generated_tokens": 10,
6+
"prefill": [
7+
{
8+
"id": 128000,
9+
"logprob": null,
10+
"text": "<|begin_of_text|>"
11+
},
12+
{
13+
"id": 3923,
14+
"logprob": -7.609375,
15+
"text": "What"
16+
},
17+
{
18+
"id": 374,
19+
"logprob": -0.92529297,
20+
"text": " is"
21+
},
22+
{
23+
"id": 5655,
24+
"logprob": -10.0,
25+
"text": " deep"
26+
},
27+
{
28+
"id": 6975,
29+
"logprob": -0.94628906,
30+
"text": " learning"
31+
},
32+
{
33+
"id": 30,
34+
"logprob": -2.9042969,
35+
"text": "?"
36+
}
37+
],
38+
"seed": null,
39+
"tokens": [
40+
{
41+
"id": 18682,
42+
"logprob": -0.8769531,
43+
"special": false,
44+
"text": " Deep"
45+
},
46+
{
47+
"id": 6975,
48+
"logprob": -0.0076942444,
49+
"special": false,
50+
"text": " learning"
51+
},
52+
{
53+
"id": 374,
54+
"logprob": -0.25073242,
55+
"special": false,
56+
"text": " is"
57+
},
58+
{
59+
"id": 264,
60+
"logprob": -0.097595215,
61+
"special": false,
62+
"text": " a"
63+
},
64+
{
65+
"id": 955,
66+
"logprob": -0.921875,
67+
"special": false,
68+
"text": " type"
69+
},
70+
{
71+
"id": 315,
72+
"logprob": -0.00027918816,
73+
"special": false,
74+
"text": " of"
75+
},
76+
{
77+
"id": 21075,
78+
"logprob": -0.5527344,
79+
"special": false,
80+
"text": " artificial"
81+
},
82+
{
83+
"id": 11478,
84+
"logprob": -0.042541504,
85+
"special": false,
86+
"text": " intelligence"
87+
},
88+
{
89+
"id": 320,
90+
"logprob": -0.38891602,
91+
"special": false,
92+
"text": " ("
93+
},
94+
{
95+
"id": 15836,
96+
"logprob": -0.0011043549,
97+
"special": false,
98+
"text": "AI"
99+
}
100+
],
101+
"top_tokens": null
102+
},
103+
"generated_text": " Deep learning is a type of artificial intelligence (AI"
104+
}
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
{
2+
"details": {
3+
"best_of_sequences": null,
4+
"finish_reason": "length",
5+
"generated_tokens": 10,
6+
"prefill": [
7+
{
8+
"id": 128000,
9+
"logprob": null,
10+
"text": "<|begin_of_text|>"
11+
},
12+
{
13+
"id": 3923,
14+
"logprob": -7.609375,
15+
"text": "What"
16+
},
17+
{
18+
"id": 374,
19+
"logprob": -0.92529297,
20+
"text": " is"
21+
},
22+
{
23+
"id": 5655,
24+
"logprob": -10.0,
25+
"text": " deep"
26+
},
27+
{
28+
"id": 6975,
29+
"logprob": -0.94628906,
30+
"text": " learning"
31+
}
32+
],
33+
"seed": 0,
34+
"tokens": [
35+
{
36+
"id": 5380,
37+
"logprob": -0.23840332,
38+
"special": false,
39+
"text": "?\n"
40+
},
41+
{
42+
"id": 34564,
43+
"logprob": 0.0,
44+
"special": false,
45+
"text": "Deep"
46+
},
47+
{
48+
"id": 6975,
49+
"logprob": 0.0,
50+
"special": false,
51+
"text": " learning"
52+
},
53+
{
54+
"id": 11,
55+
"logprob": 0.0,
56+
"special": false,
57+
"text": ","
58+
},
59+
{
60+
"id": 1101,
61+
"logprob": -1.2011719,
62+
"special": false,
63+
"text": " also"
64+
},
65+
{
66+
"id": 3967,
67+
"logprob": 0.0,
68+
"special": false,
69+
"text": " known"
70+
},
71+
{
72+
"id": 439,
73+
"logprob": 0.0,
74+
"special": false,
75+
"text": " as"
76+
},
77+
{
78+
"id": 30828,
79+
"logprob": 0.0,
80+
"special": false,
81+
"text": " neural"
82+
},
83+
{
84+
"id": 4009,
85+
"logprob": -0.6777344,
86+
"special": false,
87+
"text": " network"
88+
},
89+
{
90+
"id": 477,
91+
"logprob": 0.0,
92+
"special": false,
93+
"text": " or"
94+
}
95+
],
96+
"top_tokens": null
97+
},
98+
"generated_text": "What is deep learning?\nDeep learning, also known as neural network or"
99+
}

0 commit comments

Comments
 (0)