Hqq support #21

ElizaWszola · 2024-10-14T12:18:52Z

unit tests:

pytest tests/kernels/test_marlin_gemm.py -k test_hqq_marlin_gemm

offline inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

model_path = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quant_config = HqqConfig(nbits=4, group_size=64, axis=1)

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             torch_dtype=torch.float16,
                                             cache_dir='.',
                                             device_map="cuda:0",
                                             quantization_config=quant_config,
                                             low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

qp = "tinyllama_hqq"
model.save_pretrained(qp)
tokenizer.save_pretrained(qp)

llm = LLM(
    model=qp,
    quantization="hqq",
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

github-actions · 2024-10-14T12:19:06Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-redhat

The main this that needs to be updated in this PR is that we should not make any changes to the vllm/model_executor/models directory (there should be no changes made to llama.py). This allows us to encapsulate the details of HQQ. Right now, it is coupled with llama.py so it will only work for this model

Just like the other quantization methods (e.g. GPTQMarlin), we should setup create_weights such that the state dict of the vllm model matches the state_dict of the serialized model ... (for example -> this hqq_map for example should not be needed. Instead, just name the parameter W_q rather than .qweight)

Additionally, the conversion from the serialized format to the kernel format should be handled in process_weights_after_loading. So the create_weights should make tensors with the same type / shape as the serialized state dict and then functions that convert to the kernel format (e.g. unpack_4bit_u8) can do the conversion during process_weights_after_loading

Is there something unique about HQQ that prevents us from following this pattern?

ElizaWszola · 2024-10-24T14:38:15Z

Is there something unique about HQQ that prevents us from following this pattern?

@robertgshaw2-neuralmagic My main difficulty has been the 4-bit quantization pattern where a tensor A of size (2M, N) is quantized such that the lower 4-bits of the 8-bit result elements correspond to the first (M, N) elements of A (while high 4-bits stand for last (M, N) elements of A). This was causing some issues with sharding, so I ended up unpacking from 4-bit to 8-bit when loading data with llama.py. It gets repacked into marlin format later.

ElizaWszola added 4 commits October 7, 2024 07:05

pre-merge

6cdd1c7

Merge branch 'main' into hqq-support

467ccd8

Merge branch 'main' into hqq-support

117eff6

try different shapes

ee54bca

ElizaWszola added 6 commits October 15, 2024 11:01

works with a hack

d3b5c12

cleanup

45a993b

more cleanup

3324b2e

it works

271ff95

cleanup

de8c5f0

further cleanup

1521370

robertgshaw2-redhat requested changes Oct 22, 2024

View reviewed changes

ElizaWszola added 3 commits October 23, 2024 10:16

Merge branch 'main' into hqq-support

05f1fce

Adapt to model format prepared with transformers

9972c88

small cleanup

7717b55

ElizaWszola added 4 commits October 24, 2024 11:24

reshape cleanup

b9106bd

hqq unit tests

5ef4b80

remove hardcoded type

5340ce8

force fp16 type in kernel to reduce wheel size

1da2d97

robertgshaw2-redhat closed this May 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hqq support #21

Hqq support #21

Uh oh!

ElizaWszola commented Oct 14, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

robertgshaw2-redhat left a comment

Uh oh!

ElizaWszola commented Oct 24, 2024

Uh oh!

Uh oh!

Hqq support #21

Hqq support #21

Uh oh!

Conversation

ElizaWszola commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Oct 24, 2024

Uh oh!

Uh oh!

ElizaWszola commented Oct 14, 2024 •

edited

Loading