For the Hopper architecture, the mixed_dtype_grouped_gemm currently only supports row-wise scaling. However, for the AWQ quantization, the precision loss is still quite significant.
Will CUTLASS support the Zero-points and block-wise scaling of AWQ (W4A16 / W4A8) for MoE models?
Thanks~