Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qesstion] Wrapper Linear API and 2bits #589

Open
wenhuach21 opened this issue Aug 14, 2024 · 4 comments
Open

[qesstion] Wrapper Linear API and 2bits #589

wenhuach21 opened this issue Aug 14, 2024 · 4 comments

Comments

@wenhuach21
Copy link

Thanks for your great work.

1 Is there an API for packing the linear layer and running inference in the GPTQV2 format, similar to what's provided here: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_exllamav2.py?

2 is 2-bit quantization supported?

@turboderp
Copy link
Owner

  1. I'm not sure what you mean, GPTQv2 is supported since recently. The difference was just whether the qzeros tensor was offset by one or not, and ExLlama now figures that out from the config.json.

  2. 2-bit quantization is supported in EXL2, but there's no kernel yet for 2-bit GPTQ tensors. It is planned, but I have so many other things to get to also.

@wenhuach21
Copy link
Author

wenhuach21 commented Aug 15, 2024

Thank you for your quick response.

I. I'm currently working on an INT4 algorithm and need to export the model to another format due to specific requirements. We plan to use your repo as the default CUDA backend. Could you let me know if there is an interface available to replace the original Linear Layer with your INT4 layer, as I am not familiar with the kernel part.

This is our repo https://github.com/intel/auto-round

@turboderp
Copy link
Owner

turboderp commented Aug 17, 2024

I have plans to create a torch.nn module for EXL2 linear layers, but I'm so busy with tensor parallel inference at the moment I'm not sure I'll get to it for at least a little while.

In the meantime you could look at this, which is the AutoGPTQ implementation of a GPTQ(v2) Linear module using the ExLlamaV2 kernel.

If you wanted to support the EXL2 format rather than GPTQ, note that it's symmetric only, uses quantized scales and variable bitrates within each tensor (essentially by slicing it into rows and providing a variable number of 8-bit, 6-bit, 5-bit etc. rows in that order, sorting the rows by activation order to place more salient weights on top.

Both the GPTQ and EXL2 implementations use an unmanaged object (QMatrix) to store shapes and pointers for the weights, which reduces Python/pybind overhead and makes the matrix easily accessible from other C++ portions of ExLlama, but probably isn't too relevant for a torch.nn implementation and would lead to slight memory leaks if the layers aren't explicitly unloaded before being garbage-collected.

Either way the interface for the extension is just:

def gemm(
    x: torch.Tensor,  # Input tensor, FP16, contiguous
    q_handle: int,  # uintptr_t to QMatrix
    q4_width: int,  # out_features
    force_cuda: bool,  # Optionally disable the reconstruct/cuBLAS path for large inputs
):
    # Final shape of output tensor
    output_shape = x.shape[:-1] + (q4_width,)

    # Flatten input tensor to matrix
    x = x.view(-1, x.shape[-1])
    
    # Prepare empty tensor for result
    output = torch.empty((x.shape[0], q4_width), dtype=torch.half, device=x.device)

    # Call the extension function
    gemm_half_q_half(x, q_handle, output, force_cuda)

    # Restore output dimensions
    return output.view(output_shape)

What particular requirements would your format have? Is it the GPTQ tensor format, or does it deviate from it somehow?

@wenhuach21
Copy link
Author

Thank you for your detailed reply!

Yes, we need a similar torch.nn module for EXL2 linear layers, which will make integration easier.

AutoGPTQ should already support asymmetric quantization, while symmetric performs poorly at 2 bits.

Our format is built on GPTQ's but removes the qzero±1. We also use different configurations to support mixed precisions and a broader range of devices; however, this should not place any additional requirements on the kernel side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants