-
Notifications
You must be signed in to change notification settings - Fork 139
AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI - I also try with w8a16 and it works, the problem is in AWQ
|
Hi @ehartford , thanks for your interest in AWQ and for bringing this to our attention. While it seems the non-MoE Qwen3 models ran, these MoE models are hanging while resolving the mappings. We are using string matches, and it causes runtime to increase dramatically looping over 48 layers, each with 128 experts in the case of This isn't an issue in AutoAWQ, which has custom wrappers for each model (Qwen3MoE example here). I will try to address this by end of next week |
I'm running your AWQ code on a single RTX A6000 48 GB VRAM and after allocating ~42 GB for the model it sits with no GPU utilization and a single CPU core spinning at 100% for python. I'll let it sit overnight and possibly it will loop over the 48 layers x 128 experts eventually?
When I tried
I saw an open issue on the hugging face repo too: https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/12 Will check in later, thanks! |
Ok but I think it will hang there forever, I let mine sit overnight |
lmao, it seems like it got through the loop but then of course it OOMd when it went to do the actual thing hahah
So if you have enough VRAM you might wake up to the worlds first Qwen3-30B-A3B AWQ who knows xD! Looking at the timestamps from the logs it took a little over 30 minutes to work through the loop on a AMD Ryzen Threadripper PRO 7965WX 24-Cores (running 1 core single threaded on python). |
Yes, it will likely OOM for larger models. We cache the calibrated activations for the entire model, rather than layer-by-layer, so memory requirements do not scale well with model size. AutoAWQ handles this, but we need to integrate our own pipelining abstraction and wanted to do that in a follow-up PR. We need to add that feature in order for our implementation of AWQ to really be fully ready, what we have so far is a basic port of AutoAWQ not quite ready for primetime. Related issue -- #1369 (comment) |
Thanks! Yeah and seems like no support for CPU backend as I tried: I'd love to get AWQ going and output GGUFs to test against ik_llama.cpp imatrix quants e.g. my ubergarm/Qwen3-30B-A3B-GGUF Guessing inference speed with vllm would be better, and not sure how to test perplexity and KLD etc on AWQ quants. Anyway, beyond the scope. Cheers and thanks for all your efforts! |
@ehartford just got this running a moment ago, takes about 17GB VRAM to load plus as much extra for parallel inferencing slots:
Not sure how they quantized their model, but maybe how you were trying with enough time and VRAM. |
Hi @ubergarm , yes AWQ will require a GPU to run in a reasonable amount of time for most models. We've got that somewhat hard-coded for now, and we'll have better support for offloaded models in a future release. Yeah, I noticed Qwen publishes some AWQ-ed models (https://huggingface.co/Qwen/Qwen3-32B-AWQ) but no MoE models. There do seem to be lots in the community though 💪 |
SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves #1409 Resolves #1369 Related to #1383 Related to #1406 Related to #1368 Related to #1410 More improvements split into #1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>
Hello @brian-dellabetta - do you have any recommendation for quantizing to FP8 a dense qwen3 (like the 14b) ? Especially which part to ignore like in |
Hi @fpaupier , for |
Great, thanks for your insights @brian-dellabetta 👍 |
Hi @ehartford , @ubergarm: quick update: on the #1444 branch, I was able to quantize Please note two more key PRs need to land to improve memory requirements during saving and when running AWQ through the pipeline: I am hoping we can wrap these up soon and make a fresh release with AWQ in a much less experimental stage. Some of our wires got crossed in communicating the AWQ feature in |
Oh nice, thanks for the update and congrats on getting Qwen3-30B-A3B going, it is a pretty nice model in my testing both for speed and reasonable quality. Hrmm, I wish I had some kind of test harness to get apples-apples perplexity and KL-Divergence comparisons with these AWQ quants. I have been using ik_llama.cpp and exllamav3 for some Qwen3-30B-A3B comparisons and the new QTIP/trellis/exl3/ikN_kt quants are looking pretty good compared. The graph is way overpacked though sorry about that haha... I'm weak on the native transformers side of things given I've been mostly ik_llama.cpp focused lately. hrmm.. Anyways, thanks again for the test quant maybe I'll figure something out to compare it! |
@ubergarm very nice plot! For wiktext-2, we usually use lm_eval to calculate perplexity:
Will share with someone from the research team |
Very cool, thanks for showing me how to run that and giving a clear example and result! I'd have to play with the parameters some as it is always challenging to get apples-apples numbers/comparisons between different systems. Here are the perplexity values I had which seem off from yours. These were done at 2k ctx size i believe which can effect things too if that is different. Thanks so much for your time and patience on this thread haha! Cheers! |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
When I try to AWQ these models, it hangs forever.
Expected behavior
I expect it to quantize the model
Environment
Nvidia DGX A100
To Reproduce
I used examples/awq/awq_one_shot.py and modified it:
The output
The text was updated successfully, but these errors were encountered: