Implement I-quants (IQ4XS, IQ4NL) #2785

EricLBuehler · 2025-02-24T11:55:28Z

This PR refactors the quantization parts of candle-core a bit and integrates some new I-quants!

There is no CUDA or Metal support yet; perhaps we could add that in a later PR. I have added Metal support on a local branch, and I'm working on syncing the latest GGML CUDA kernels, which should also give a nice performance boost!

candle-core/src/quantized/iq_quants/mod.rs

EricLBuehler · 2025-02-24T16:56:45Z

This PR is based on the following reference ggml quantization/dequantization functions:

Dequantization:
https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L2434-L2475

Quantization:
https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L4562-L4745

Vec dot:
https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-cpu/ggml-cpu-quants.c#L11670-L12233

ivarflakstad

Reviewing in chunks. Around half way there.

candle-core/src/quantized/iq_quants/utils.rs

candle-core/src/quantized/iq_quants/mod.rs

ivarflakstad · 2025-03-03T13:19:19Z

candle-core/src/quantized/k_quants/utils.rs

+    let expected_blocks = xs.len() / block_size;
+    let actual_blocks = ys.len();
+
+    // Validate that the input is the right size
+    if expected_blocks != actual_blocks {
+        crate::bail!("quantize {dtype:?}: expected {expected_blocks} blocks but only {actual_blocks} were provided!")
+    }


I am not opposed to doing this check every time, as that is safest, but this may be a decent case for debug_assert. What do you think @LaurentMazare?

ivarflakstad · 2025-03-03T13:40:25Z

candle-core/tests/quantized_tests.rs

-        assert_eq!(diff, 0.);
+        assert!(diff < 0.96);


The assert_eq version actually passes on my machine for both the neon and default impl. Is this increase in tolerance because of the avx impl perhaps?

Hmm, I think that was a remnant from when I did the other neon/default impls. I undid this change (I'm still working out a bug in the iq4xs avx impl, iq4nl works nicely).

EricLBuehler added 5 commits February 24, 2025 06:49

Integrate iq quants

c38ee86

Add some tests

48b3f9c

Add todos in metal/cuda

6148daf

Fix wasm tests

919dddc

Remove SUPPORTS_I8MM

72264fa

EricLBuehler commented Feb 24, 2025

View reviewed changes

candle-core/src/quantized/iq_quants/mod.rs Outdated Show resolved Hide resolved

Add avx impls

2c107ac

ivarflakstad reviewed Mar 3, 2025

View reviewed changes

EricLBuehler added 2 commits March 3, 2025 17:26

Use constant for f16 zero

06335f8

Undo change to tolerance in quantized_matmul_neg test

a977179

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement I-quants (IQ4XS, IQ4NL) #2785

Implement I-quants (IQ4XS, IQ4NL) #2785

EricLBuehler commented Feb 24, 2025

EricLBuehler commented Feb 24, 2025 •

edited

Loading

ivarflakstad left a comment

ivarflakstad Mar 3, 2025

ivarflakstad Mar 3, 2025

EricLBuehler Mar 3, 2025

Implement I-quants (IQ4XS, IQ4NL) #2785

Are you sure you want to change the base?

Implement I-quants (IQ4XS, IQ4NL) #2785

Conversation

EricLBuehler commented Feb 24, 2025

EricLBuehler commented Feb 24, 2025 • edited Loading

ivarflakstad left a comment

Choose a reason for hiding this comment

ivarflakstad Mar 3, 2025

Choose a reason for hiding this comment

ivarflakstad Mar 3, 2025

Choose a reason for hiding this comment

EricLBuehler Mar 3, 2025

Choose a reason for hiding this comment

EricLBuehler commented Feb 24, 2025 •

edited

Loading