-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hqq support #21
base: main
Are you sure you want to change the base?
Hqq support #21
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main this that needs to be updated in this PR is that we should not make any changes to the vllm/model_executor/models
directory (there should be no changes made to llama.py
). This allows us to encapsulate the details of HQQ. Right now, it is coupled with llama.py
so it will only work for this model
Just like the other quantization methods (e.g. GPTQMarlin
), we should setup create_weights
such that the state dict of the vllm model matches the state_dict of the serialized model ... (for example -> this hqq_map
for example should not be needed. Instead, just name the parameter W_q
rather than .qweight
)
Additionally, the conversion from the serialized format to the kernel format should be handled in process_weights_after_loading
. So the create_weights
should make tensors with the same type / shape as the serialized state dict and then functions that convert to the kernel format (e.g. unpack_4bit_u8
) can do the conversion during process_weights_after_loading
Is there something unique about HQQ that prevents us from following this pattern?
@robertgshaw2-neuralmagic My main difficulty has been the 4-bit quantization pattern where a tensor |
unit tests:
offline inference: