Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] W4A4 Quantization Support in torchao #1406

Open
xxw11 opened this issue Dec 12, 2024 · 5 comments
Open

[Feature Request] W4A4 Quantization Support in torchao #1406

xxw11 opened this issue Dec 12, 2024 · 5 comments
Labels
topic: new feature Use this tag if this PR adds a new feature topic: performance Use this tag if this PR improves the performance of a feature

Comments

@xxw11
Copy link

xxw11 commented Dec 12, 2024

Dear team,

I would like to inquire about the possibility of W4A4 quantization support in torchao.

Torchao has proven to be an excellent quantization inference tool, particularly with its comprehensive support for W8A8. However, regarding 4-bit operations, I've only noticed W4A8 implementation (which currently utilizes INT8 GEMM operators under the hood). Given that many modern GPUs now support INT4 GEMM operators with promising results, I was wondering if there are any plans to implement W4A4 in torchao?

Thank you for your attention to this matter.

Best regards

@gau-nernst
Copy link
Collaborator

Accuracy-wise, I think getting W4A4 to work is quite challenging, especially for post-training quantization. Cutlass does provide some INT4 kernels and I have successfully run them.

FYI, H100 and newer GPUs don't have INT4 tensor cores anymore, so it might not be worth it to invest efforts into it.

@supriyar
Copy link
Contributor

@xxw11 which GPU are you looking to utilize the W4A4 operation on? And are there any existing performant GEMM ops for this that utilize the INT4 tensor cores effectively?

We have plans to implement FP4 support in torchao once the spec is released and will welcome any community contributions around fast W4A4 kernels.

Agree on the accuracy point raised by @gau-nernst, we can explore techniques like QAT to mitigate that.

@drisspg drisspg added topic: new feature Use this tag if this PR adds a new feature topic: performance Use this tag if this PR improves the performance of a feature labels Dec 12, 2024
@gau-nernst
Copy link
Collaborator

I did explore INT4 tensor cores briefly. Here is the benchmarks for some matmul sizes (see Cutlass INT4) https://github.com/gau-nernst/quantized-training?tab=readme-ov-file#matmul. You can see that INT4 perf on H100 is very bad because H100 doesn't have the hardware for INT4 (so I'm guessing it's running in some kind of emulation math).

I rmb 1 work that uses INT4 activation is from the BitNet series https://arxiv.org/pdf/2411.04965, which only uses INT4 activations for some layers. It's more extreme in the sense that it uses 1.58-bit for weight, while we can use 4-bit instead. But again, accuracy is still an issue, as well as the lack of INT4 support in newer and future GPUs.

@xxw11
Copy link
Author

xxw11 commented Dec 13, 2024

Accuracy-wise, I think getting W4A4 to work is quite challenging, especially for post-training quantization. Cutlass does provide some INT4 kernels and I have successfully run them.

FYI, H100 and newer GPUs don't have INT4 tensor cores anymore, so it might not be worth it to invest efforts into it.

@xxw11 which GPU are you looking to utilize the W4A4 operation on? And are there any existing performant GEMM ops for this that utilize the INT4 tensor cores effectively?

We have plans to implement FP4 support in torchao once the spec is released and will welcome any community contributions around fast W4A4 kernels.

Agree on the accuracy point raised by @gau-nernst, we can explore techniques like QAT to mitigate that.

I am using consumer-grade GPUs like the RTX 4090, so INT4 compute capabilities would be particularly valuable for my use case. I find torchao's AffineQuantizedTensor implementation to be both elegant and versatile. It seems that supporting INT4 would only require incremental implementation of corresponding checks and implementations, which would be tremendously helpful for our work if achieved.

I strongly agree with the assessment regarding the generally suboptimal performance of PTQ algorithms in W4A4 scenarios. This is actually a hot research topic in many recent academic papers. Perhaps we could incorporate additional QAT algorithms to address this limitation. If I find suitable QAT algorithms, I would be happy to contribute to their implementation in torchao.

Thank you very much for your response and guidance on this matter.

@gau-nernst
Copy link
Collaborator

Once #880 is merged, it would be easier to add cutlass-backed W4A4 since that PR will include cutlass in torchao (will need to check if that cutlass version will work with W4A4 without any extra patches...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: new feature Use this tag if this PR adds a new feature topic: performance Use this tag if this PR improves the performance of a feature
Projects
None yet
Development

No branches or pull requests

4 participants