AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

ehartford · 2025-05-01T01:33:01Z

Describe the bug
When I try to AWQ these models, it hangs forever.

Expected behavior
I expect it to quantize the model

Environment
Nvidia DGX A100

To Reproduce

I used examples/awq/awq_one_shot.py and modified it:

from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3-30B-A3B"
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
OUTPUT_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"


def get_calib_dataset(tokenizer):
    from datasets import load_dataset

    ds = load_dataset(
        DATASET_ID,
        split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES*100}]",
    )

    def preprocess(example):
        return {
            "input_ids": tokenizer.encode(example["text"].strip())[:MAX_SEQUENCE_LENGTH]
        }

    ds = (
        ds.shuffle(seed=42)
        .map(preprocess, remove_columns=ds.column_names)
        .filter(lambda example: len(example["input_ids"]) >= MAX_SEQUENCE_LENGTH)
        .select(range(NUM_CALIBRATION_SAMPLES))
    )

    return ds


if __name__ == "__main__":
    recipe = [
        AWQModifier(bits=4, symmetric=False),
        QuantizationModifier(
            # Ignore these layers during quantization
            ignore=[
                "lm_head",
                ".*norm.*",
                ".*gate.*",
            ],
            config_groups={
                "group_0": QuantizationScheme(
                    targets=["Linear"],
                    weights=QuantizationArgs(
                        num_bits=4,
                        type=QuantizationType.INT,
                        dynamic=False,
                        symmetric=False,
                        strategy=QuantizationStrategy.GROUP,
                        group_size=128,
                    ),
                )
            },
        ),
    ]

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, device_map="auto", torch_dtype="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

    oneshot(
        model=model,
        dataset=get_calib_dataset(tokenizer=tokenizer),
        recipe=recipe,
        output_dir=OUTPUT_DIR,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    print("Done! model saved to", OUTPUT_DIR)

The output

(vllm) dgxuser@linux:~/workspace/llm-compressor$ python qwen_moe_awq.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████| 16/16 [00:22<00:00,  1.39s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-30T18:26:27.175014-0700 | reset | INFO - Compression lifecycle reset
2025-04-30T18:26:27.175475-0700 | from_modifiers | INFO - Creating recipe from modifiers

The text was updated successfully, but these errors were encountered:

ehartford · 2025-05-01T02:27:10Z

FYI - I also try with w8a16 and it works, the problem is in AWQ

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer 
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

MODEL_ID = "Qwen/Qwen3-235B-A22B"

device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=True,
    num_gpus=8,  
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained( # Reverted to AutoModelForCausalLM
    MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

DATASET_ID = "neuralmagic/LLM_compression_calibration"
NUM_CALIBRATION_SAMPLES = 256 
MAX_SEQUENCE_LENGTH = 8192 

ds = load_dataset(DATASET_ID, split="train") 
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = ds.map(preprocess)

def tokenize(example):
    return tokenizer(
        example["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=["messages", "text"])

recipe = GPTQModifier(
    targets="Linear",
    scheme="W8A16", 
    ignore=["lm_head", ".*gate.*", ".*norm.*"],
    dampening_frac=0.1, 
)

SAVE_DIR = "Qwen3-235B-A22B-quantized.w8a16"


oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    save_compressed=True,
    trust_remote_code_model=True,
    output_dir=SAVE_DIR,
)

input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

brian-dellabetta · 2025-05-01T19:37:51Z

Hi @ehartford , thanks for your interest in AWQ and for bringing this to our attention. While it seems the non-MoE Qwen3 models ran, these MoE models are hanging while resolving the mappings. We are using string matches, and it causes runtime to increase dramatically looping over 48 layers, each with 128 experts in the case of Qwen/Qwen3-30B-A3B.

This isn't an issue in AutoAWQ, which has custom wrappers for each model (Qwen3MoE example here).

I will try to address this by end of next week

ubergarm · 2025-05-02T05:29:24Z

@ehartford

I'm running your AWQ code on a single RTX A6000 48 GB VRAM and after allocating ~42 GB for the model it sits with no GPU utilization and a single CPU core spinning at 100% for python. I'll let it sit overnight and possibly it will loop over the 48 layers x 128 experts eventually?

$ CUDA_VISIBLE_DEVICES=0 python compressor.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:12<00:00,  1.30it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
Repo card metadata block was not found. Setting CardData to empty.
2025-05-02T01:15:02.456847-0400 | reset | INFO - Compression lifecycle reset
2025-05-02T01:15:02.457368-0400 | from_modifiers | INFO - Creating recipe from modifiers

When I tried AutoAWQ directly after updating transformers and it said

TypeError: qwen3_moe isn't supported yet.

I saw an open issue on the hugging face repo too: https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/12

Will check in later, thanks!

ehartford · 2025-05-02T09:25:00Z

Ok but I think it will hang there forever, I let mine sit overnight

ubergarm · 2025-05-02T13:47:31Z

@ehartford

lmao, it seems like it got through the loop but then of course it OOMd when it went to do the actual thing hahah

2025-05-02T01:15:02.456847-0400 | reset | INFO - Compression lifecycle reset
2025-05-02T01:15:02.457368-0400 | from_modifiers | INFO - Creating recipe from modifiers
2025-05-02T01:47:01.661135-0400 | _set_resolved_mappings | INFO - Excluded 48 from resolved mappings due to shape mismatch
2025-05-02T01:47:02.270633-0400 | _calibrate | INFO - Running AWQModifier calibration with 256 samples...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [11:34<00:00,  2.71s/it]
2025-05-02T02:00:16.284428-0400 | _apply_smoothing | INFO - Smoothing activation scales...
  0%|                                                                                                                        | 0/6240 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/w/projects/vllm/compressor.py", line 76, in <module>
    oneshot(
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 179, in oneshot
    one_shot()
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 131, in __call__
    self.apply_recipe_modifiers(
.
.
.
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 54, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 47.41 GiB of which 131.44 MiB is free. Includ
ing non-PyTorch memory, this process has 47.26 GiB memory in use. Of the allocated memory 46.00 GiB is allocated by PyTorch, and 972.30 MiB is reserve
d by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragme
ntation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

So if you have enough VRAM you might wake up to the worlds first Qwen3-30B-A3B AWQ who knows xD!

Looking at the timestamps from the logs it took a little over 30 minutes to work through the loop on a AMD Ryzen Threadripper PRO 7965WX 24-Cores (running 1 core single threaded on python).

brian-dellabetta · 2025-05-02T14:20:39Z

Yes, it will likely OOM for larger models. We cache the calibrated activations for the entire model, rather than layer-by-layer, so memory requirements do not scale well with model size. AutoAWQ handles this, but we need to integrate our own pipelining abstraction and wanted to do that in a follow-up PR. We need to add that feature in order for our implementation of AWQ to really be fully ready, what we have so far is a basic port of AutoAWQ not quite ready for primetime.

Related issue -- #1369 (comment)

ubergarm · 2025-05-02T15:37:52Z

@brian-dellabetta

Thanks! Yeah and seems like no support for CPU backend as I tried: CUDA_VISIBLE_DEVICES="NONE" and get RuntimeError: No CUDA GPUs are available.

I'd love to get AWQ going and output GGUFs to test against ik_llama.cpp imatrix quants e.g. my ubergarm/Qwen3-30B-A3B-GGUF

Guessing inference speed with vllm would be better, and not sure how to test perplexity and KLD etc on AWQ quants. Anyway, beyond the scope. Cheers and thanks for all your efforts!

ubergarm · 2025-05-02T17:01:32Z

@ehartford just got this running a moment ago, takes about 17GB VRAM to load plus as much extra for parallel inferencing slots:

CUDA_VISIBLE_DEVICES="0" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
  serve swift/Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --served-model-name swift/Qwen3-30B-A3B-AWQ \
  --host 127.0.0.1 \
  --port 8080

Not sure how they quantized their model, but maybe how you were trying with enough time and VRAM.

brian-dellabetta · 2025-05-02T17:59:37Z

Hi @ubergarm , yes AWQ will require a GPU to run in a reasonable amount of time for most models. We've got that somewhat hard-coded for now, and we'll have better support for offloaded models in a future release.

Yeah, I noticed Qwen publishes some AWQ-ed models (https://huggingface.co/Qwen/Qwen3-32B-AWQ) but no MoE models. There do seem to be lots in the community though 💪

SUMMARY: - Add QuantizationMixin to AWQModifier so we don't have redundant inputs (num_bits, symmetric, group_size) - Move AWQModifier to sequential pipelines, to avoid huge memory requirements of caching all activations at once. Regression test results are acceptable, results are all roughly the same, and within stderr, see test plan below. Resolves #1409 Resolves #1369 Related to #1383 Related to #1406 Related to #1368 Related to #1410 More improvements split into #1435 TEST PLAN: - [x] Rerun tests to validate No regression in tests, comparing against those reported in [original AWQ PR](#1177 (comment)). All gsm8k results are within stderr: | Type | gsm8k | wikitext | ------ | ------ | ----- | Old AWQ+QuantModifier Sym | .1054, .1069 | 9.1931 | New AWQ+QuantMixin Sym | .1077, .1084 | 9.1841 | Old AWQ+QuantModifier Asym | .1274, .1281 | 9.0281 | New AWQ+QuantMixin Asym | .1312, .1350 | 9.0288 --------- Signed-off-by: Brian Dellabetta <[email protected]> Co-authored-by: Kyle Sayers <[email protected]>

fpaupier · 2025-05-17T09:14:24Z

While it seems the non-MoE Qwen3 models ran,

Hello @brian-dellabetta - do you have any recommendation for quantizing to FP8 a dense qwen3 (like the 14b) ? Especially which part to ignore like in ignore=["lm_head"])

brian-dellabetta · 2025-05-19T15:40:35Z

Hi @fpaupier , for Qwen/Qwen3-14B you should be fine to quantize it with the FP8_DYNAMIC scheme, just ignoring lm_head. With multimodal models we usually exclude the vision component from quantizing, otherwise it's almost always ignore=["lm_head"]

fpaupier · 2025-05-19T15:43:45Z

Great, thanks for your insights @brian-dellabetta 👍

brian-dellabetta · 2025-05-22T17:11:47Z

Just as a heads-up, we are still working on this. A couple PRs have landed, but we will cut a release soon, after a couple more fixes are in transit, #1435 & #1444

brian-dellabetta · 2025-06-03T17:50:37Z

Hi @ehartford , @ubergarm:

quick update: on the #1444 branch, I was able to quantize "Qwen/Qwen3-30B-A3B". I uploaded a checkpoint to https://huggingface.co/nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym/tree/main in case others would like to check it out. You can find more details in the PR summary here, hope to merge it in soon.

Please note two more key PRs need to land to improve memory requirements during saving and when running AWQ through the pipeline:

I am hoping we can wrap these up soon and make a fresh release with AWQ in a much less experimental stage. Some of our wires got crossed in communicating the AWQ feature in llm-compressor, we were premature in some of the announcements. It should be much more robust in the next release. We will do a broader announcement at that time. Appreciate your interest in using it!

ubergarm · 2025-06-03T20:15:55Z

Oh nice, thanks for the update and congrats on getting Qwen3-30B-A3B going, it is a pretty nice model in my testing both for speed and reasonable quality.

Hrmm, I wish I had some kind of test harness to get apples-apples perplexity and KL-Divergence comparisons with these AWQ quants. I have been using ik_llama.cpp and exllamav3 for some Qwen3-30B-A3B comparisons and the new QTIP/trellis/exl3/ikN_kt quants are looking pretty good compared. The graph is way overpacked though sorry about that haha...

I'm weak on the native transformers side of things given I've been mostly ik_llama.cpp focused lately. hrmm..

Anyways, thanks again for the test quant maybe I'll figure something out to compare it!

brian-dellabetta · 2025-06-03T22:29:50Z

@ubergarm very nice plot! For wiktext-2, we usually use lm_eval to calculate perplexity:

lm_eval --model vllm --model_args pretrained="nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym",add_bos_token=True,dtype=bfloat16,max_model_len=4096,gpu_memory_utilization=0.8 --tasks wikitext

vllm (pretrained=nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym,add_bos_token=True,dtype=bfloat16,max_model_len=4096,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     5|bits_per_byte  |↓  | 0.6671|±  |   N/A|
|        |       |none  |     5|byte_perplexity|↓  | 1.5879|±  |   N/A|
|        |       |none  |     5|word_perplexity|↓  |11.8539|±  |   N/A|

Will share with someone from the research team

ubergarm · 2025-06-03T23:45:15Z

@brian-dellabetta

Very cool, thanks for showing me how to run that and giving a clear example and result! I'd have to play with the parameters some as it is always challenging to get apples-apples numbers/comparisons between different systems. Here are the perplexity values I had which seem off from yours. These were done at 2k ctx size i believe which can effect things too if that is different.

Thanks so much for your time and patience on this thread haha! Cheers!

ehartford added the bug Something isn't working label May 1, 2025

brian-dellabetta self-assigned this May 1, 2025

ehartford mentioned this issue May 2, 2025

[Feature]: support for fp8 marlin with MoE vllm-project/vllm#17579

Open

1 task

brian-dellabetta mentioned this issue May 12, 2025

AWQ QuantizationMixin + SequentialPipeline #1426

Merged

1 task

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

Comments

ehartford commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ehartford commented May 1, 2025

Uh oh!

brian-dellabetta commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented May 2, 2025

Uh oh!

ehartford commented May 2, 2025

Uh oh!

ubergarm commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented May 2, 2025

Uh oh!

ubergarm commented May 2, 2025

Uh oh!

brian-dellabetta commented May 2, 2025

Uh oh!

fpaupier commented May 17, 2025

Uh oh!

brian-dellabetta commented May 19, 2025

Uh oh!

fpaupier commented May 19, 2025

Uh oh!

brian-dellabetta commented May 22, 2025

Uh oh!

brian-dellabetta commented Jun 3, 2025

Uh oh!

ubergarm commented Jun 3, 2025

Uh oh!

brian-dellabetta commented Jun 3, 2025

Uh oh!

ubergarm commented Jun 3, 2025

Uh oh!

ehartford commented May 1, 2025 •

edited

Loading

brian-dellabetta commented May 1, 2025 •

edited

Loading

ubergarm commented May 2, 2025 •

edited

Loading

brian-dellabetta commented May 2, 2025 •

edited

Loading