Can run an int8 quantized model on CUDA? #374

shencuifeng · 2025-03-04T02:43:50Z

I want to run a model on CUDA with actual int8 instructions instead of FakeQuantised float32 instructions, and enjoy the efficiency gains. It is slower when I set weights=qint8, activations=qint8 than weights=qint8.

The text was updated successfully, but these errors were encountered:

dacorvo · 2025-03-04T08:10:51Z

@shencuifeng this is because the cost of quantizing activations on the fly is not compensated by the faster int8 matmul, especially considering that the float x int8 matmul might benefit from an accelerated kernel depending on the float type.

shencuifeng · 2025-03-04T10:02:19Z

@dacorvo Is it possible to support static quantize activations? It seems https://github.com/mit-han-lab/nunchaku static quantize the activations to int4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can run an int8 quantized model on CUDA? #374

Can run an int8 quantized model on CUDA? #374

shencuifeng commented Mar 4, 2025 •

edited

Loading

dacorvo commented Mar 4, 2025

shencuifeng commented Mar 4, 2025

Can run an int8 quantized model on CUDA? #374

Can run an int8 quantized model on CUDA? #374

Comments

shencuifeng commented Mar 4, 2025 • edited Loading

dacorvo commented Mar 4, 2025

shencuifeng commented Mar 4, 2025

shencuifeng commented Mar 4, 2025 •

edited

Loading