-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] W4A4 Quantization Support in torchao #1406
Comments
Accuracy-wise, I think getting W4A4 to work is quite challenging, especially for post-training quantization. Cutlass does provide some INT4 kernels and I have successfully run them. FYI, H100 and newer GPUs don't have INT4 tensor cores anymore, so it might not be worth it to invest efforts into it. |
@xxw11 which GPU are you looking to utilize the W4A4 operation on? And are there any existing performant GEMM ops for this that utilize the INT4 tensor cores effectively? We have plans to implement FP4 support in torchao once the spec is released and will welcome any community contributions around fast W4A4 kernels. Agree on the accuracy point raised by @gau-nernst, we can explore techniques like QAT to mitigate that. |
I did explore INT4 tensor cores briefly. Here is the benchmarks for some matmul sizes (see Cutlass INT4) https://github.com/gau-nernst/quantized-training?tab=readme-ov-file#matmul. You can see that INT4 perf on H100 is very bad because H100 doesn't have the hardware for INT4 (so I'm guessing it's running in some kind of emulation math). I rmb 1 work that uses INT4 activation is from the BitNet series https://arxiv.org/pdf/2411.04965, which only uses INT4 activations for some layers. It's more extreme in the sense that it uses 1.58-bit for weight, while we can use 4-bit instead. But again, accuracy is still an issue, as well as the lack of INT4 support in newer and future GPUs. |
I am using consumer-grade GPUs like the RTX 4090, so INT4 compute capabilities would be particularly valuable for my use case. I find torchao's AffineQuantizedTensor implementation to be both elegant and versatile. It seems that supporting INT4 would only require incremental implementation of corresponding checks and implementations, which would be tremendously helpful for our work if achieved. I strongly agree with the assessment regarding the generally suboptimal performance of PTQ algorithms in W4A4 scenarios. This is actually a hot research topic in many recent academic papers. Perhaps we could incorporate additional QAT algorithms to address this limitation. If I find suitable QAT algorithms, I would be happy to contribute to their implementation in torchao. Thank you very much for your response and guidance on this matter. |
Once #880 is merged, it would be easier to add cutlass-backed W4A4 since that PR will include cutlass in torchao (will need to check if that cutlass version will work with W4A4 without any extra patches...) |
Dear team,
I would like to inquire about the possibility of W4A4 quantization support in torchao.
Torchao has proven to be an excellent quantization inference tool, particularly with its comprehensive support for W8A8. However, regarding 4-bit operations, I've only noticed W4A8 implementation (which currently utilizes INT8 GEMM operators under the hood). Given that many modern GPUs now support INT4 GEMM operators with promising results, I was wondering if there are any plans to implement W4A4 in torchao?
Thank you for your attention to this matter.
Best regards
The text was updated successfully, but these errors were encountered: