-
-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qesstion] Wrapper Linear API and 2bits #589
Comments
|
Thank you for your quick response. I. I'm currently working on an INT4 algorithm and need to export the model to another format due to specific requirements. We plan to use your repo as the default CUDA backend. Could you let me know if there is an interface available to replace the original Linear Layer with your INT4 layer, as I am not familiar with the kernel part. This is our repo https://github.com/intel/auto-round |
I have plans to create a torch.nn module for EXL2 linear layers, but I'm so busy with tensor parallel inference at the moment I'm not sure I'll get to it for at least a little while. In the meantime you could look at this, which is the AutoGPTQ implementation of a GPTQ(v2) Linear module using the ExLlamaV2 kernel. If you wanted to support the EXL2 format rather than GPTQ, note that it's symmetric only, uses quantized scales and variable bitrates within each tensor (essentially by slicing it into rows and providing a variable number of 8-bit, 6-bit, 5-bit etc. rows in that order, sorting the rows by activation order to place more salient weights on top. Both the GPTQ and EXL2 implementations use an unmanaged object ( Either way the interface for the extension is just: def gemm(
x: torch.Tensor, # Input tensor, FP16, contiguous
q_handle: int, # uintptr_t to QMatrix
q4_width: int, # out_features
force_cuda: bool, # Optionally disable the reconstruct/cuBLAS path for large inputs
):
# Final shape of output tensor
output_shape = x.shape[:-1] + (q4_width,)
# Flatten input tensor to matrix
x = x.view(-1, x.shape[-1])
# Prepare empty tensor for result
output = torch.empty((x.shape[0], q4_width), dtype=torch.half, device=x.device)
# Call the extension function
gemm_half_q_half(x, q_handle, output, force_cuda)
# Restore output dimensions
return output.view(output_shape) What particular requirements would your format have? Is it the GPTQ tensor format, or does it deviate from it somehow? |
Thank you for your detailed reply! Yes, we need a similar torch.nn module for EXL2 linear layers, which will make integration easier. AutoGPTQ should already support asymmetric quantization, while symmetric performs poorly at 2 bits. Our format is built on GPTQ's but removes the qzero±1. We also use different configurations to support mixed precisions and a broader range of devices; however, this should not place any additional requirements on the kernel side. |
Thanks for your great work.
1 Is there an API for packing the linear layer and running inference in the GPTQV2 format, similar to what's provided here: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_exllamav2.py?
2 is 2-bit quantization supported?
The text was updated successfully, but these errors were encountered: