Skip to content

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ehartford opened this issue May 1, 2025 · 17 comments
Open

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

ehartford opened this issue May 1, 2025 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@ehartford
Copy link

ehartford commented May 1, 2025

Describe the bug
When I try to AWQ these models, it hangs forever.

Expected behavior
I expect it to quantize the model

Environment
Nvidia DGX A100

To Reproduce

I used examples/awq/awq_one_shot.py and modified it:

from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3-30B-A3B"
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
OUTPUT_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"


def get_calib_dataset(tokenizer):
    from datasets import load_dataset

    ds = load_dataset(
        DATASET_ID,
        split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES*100}]",
    )

    def preprocess(example):
        return {
            "input_ids": tokenizer.encode(example["text"].strip())[:MAX_SEQUENCE_LENGTH]
        }

    ds = (
        ds.shuffle(seed=42)
        .map(preprocess, remove_columns=ds.column_names)
        .filter(lambda example: len(example["input_ids"]) >= MAX_SEQUENCE_LENGTH)
        .select(range(NUM_CALIBRATION_SAMPLES))
    )

    return ds


if __name__ == "__main__":
    recipe = [
        AWQModifier(bits=4, symmetric=False),
        QuantizationModifier(
            # Ignore these layers during quantization
            ignore=[
                "lm_head",
                ".*norm.*",
                ".*gate.*",
            ],
            config_groups={
                "group_0": QuantizationScheme(
                    targets=["Linear"],
                    weights=QuantizationArgs(
                        num_bits=4,
                        type=QuantizationType.INT,
                        dynamic=False,
                        symmetric=False,
                        strategy=QuantizationStrategy.GROUP,
                        group_size=128,
                    ),
                )
            },
        ),
    ]

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, device_map="auto", torch_dtype="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

    oneshot(
        model=model,
        dataset=get_calib_dataset(tokenizer=tokenizer),
        recipe=recipe,
        output_dir=OUTPUT_DIR,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    print("Done! model saved to", OUTPUT_DIR)

The output

(vllm) dgxuser@linux:~/workspace/llm-compressor$ python qwen_moe_awq.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████| 16/16 [00:22<00:00,  1.39s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-30T18:26:27.175014-0700 | reset | INFO - Compression lifecycle reset
2025-04-30T18:26:27.175475-0700 | from_modifiers | INFO - Creating recipe from modifiers
@ehartford ehartford added the bug Something isn't working label May 1, 2025
@ehartford
Copy link
Author

FYI - I also try with w8a16 and it works, the problem is in AWQ

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer 
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

MODEL_ID = "Qwen/Qwen3-235B-A22B"

device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=True,
    num_gpus=8,  
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained( # Reverted to AutoModelForCausalLM
    MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

DATASET_ID = "neuralmagic/LLM_compression_calibration"
NUM_CALIBRATION_SAMPLES = 256 
MAX_SEQUENCE_LENGTH = 8192 

ds = load_dataset(DATASET_ID, split="train") 
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = ds.map(preprocess)

def tokenize(example):
    return tokenizer(
        example["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=["messages", "text"])

recipe = GPTQModifier(
    targets="Linear",
    scheme="W8A16", 
    ignore=["lm_head", ".*gate.*", ".*norm.*"],
    dampening_frac=0.1, 
)

SAVE_DIR = "Qwen3-235B-A22B-quantized.w8a16"


oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    save_compressed=True,
    trust_remote_code_model=True,
    output_dir=SAVE_DIR,
)

input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

@brian-dellabetta brian-dellabetta self-assigned this May 1, 2025
@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented May 1, 2025

Hi @ehartford , thanks for your interest in AWQ and for bringing this to our attention. While it seems the non-MoE Qwen3 models ran, these MoE models are hanging while resolving the mappings. We are using string matches, and it causes runtime to increase dramatically looping over 48 layers, each with 128 experts in the case of Qwen/Qwen3-30B-A3B.

This isn't an issue in AutoAWQ, which has custom wrappers for each model (Qwen3MoE example here).

I will try to address this by end of next week

@ubergarm
Copy link

ubergarm commented May 2, 2025

@ehartford

I'm running your AWQ code on a single RTX A6000 48 GB VRAM and after allocating ~42 GB for the model it sits with no GPU utilization and a single CPU core spinning at 100% for python. I'll let it sit overnight and possibly it will loop over the 48 layers x 128 experts eventually?

$ CUDA_VISIBLE_DEVICES=0 python compressor.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:12<00:00,  1.30it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
Repo card metadata block was not found. Setting CardData to empty.
2025-05-02T01:15:02.456847-0400 | reset | INFO - Compression lifecycle reset
2025-05-02T01:15:02.457368-0400 | from_modifiers | INFO - Creating recipe from modifiers

When I tried AutoAWQ directly after updating transformers and it said

TypeError: qwen3_moe isn't supported yet.

I saw an open issue on the hugging face repo too: https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/12

Will check in later, thanks!

@ehartford
Copy link
Author

Ok but I think it will hang there forever, I let mine sit overnight

@ubergarm
Copy link

ubergarm commented May 2, 2025

@ehartford

lmao, it seems like it got through the loop but then of course it OOMd when it went to do the actual thing hahah

2025-05-02T01:15:02.456847-0400 | reset | INFO - Compression lifecycle reset
2025-05-02T01:15:02.457368-0400 | from_modifiers | INFO - Creating recipe from modifiers
2025-05-02T01:47:01.661135-0400 | _set_resolved_mappings | INFO - Excluded 48 from resolved mappings due to shape mismatch
2025-05-02T01:47:02.270633-0400 | _calibrate | INFO - Running AWQModifier calibration with 256 samples...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [11:34<00:00,  2.71s/it]
2025-05-02T02:00:16.284428-0400 | _apply_smoothing | INFO - Smoothing activation scales...
  0%|                                                                                                                        | 0/6240 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/w/projects/vllm/compressor.py", line 76, in <module>
    oneshot(
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 179, in oneshot
    one_shot()
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 131, in __call__
    self.apply_recipe_modifiers(
.
.
.
  File "/home/w/projects/vllm/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 54, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 47.41 GiB of which 131.44 MiB is free. Includ
ing non-PyTorch memory, this process has 47.26 GiB memory in use. Of the allocated memory 46.00 GiB is allocated by PyTorch, and 972.30 MiB is reserve
d by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragme
ntation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

So if you have enough VRAM you might wake up to the worlds first Qwen3-30B-A3B AWQ who knows xD!

Looking at the timestamps from the logs it took a little over 30 minutes to work through the loop on a AMD Ryzen Threadripper PRO 7965WX 24-Cores (running 1 core single threaded on python).

@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented May 2, 2025

Yes, it will likely OOM for larger models. We cache the calibrated activations for the entire model, rather than layer-by-layer, so memory requirements do not scale well with model size. AutoAWQ handles this, but we need to integrate our own pipelining abstraction and wanted to do that in a follow-up PR. We need to add that feature in order for our implementation of AWQ to really be fully ready, what we have so far is a basic port of AutoAWQ not quite ready for primetime.

Related issue -- #1369 (comment)

@ubergarm
Copy link

ubergarm commented May 2, 2025

@brian-dellabetta

Thanks! Yeah and seems like no support for CPU backend as I tried: CUDA_VISIBLE_DEVICES="NONE" and get RuntimeError: No CUDA GPUs are available.

I'd love to get AWQ going and output GGUFs to test against ik_llama.cpp imatrix quants e.g. my ubergarm/Qwen3-30B-A3B-GGUF

Guessing inference speed with vllm would be better, and not sure how to test perplexity and KLD etc on AWQ quants. Anyway, beyond the scope. Cheers and thanks for all your efforts!

@ubergarm
Copy link

ubergarm commented May 2, 2025

@ehartford just got this running a moment ago, takes about 17GB VRAM to load plus as much extra for parallel inferencing slots:

CUDA_VISIBLE_DEVICES="0" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
  serve swift/Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --served-model-name swift/Qwen3-30B-A3B-AWQ \
  --host 127.0.0.1 \
  --port 8080

Not sure how they quantized their model, but maybe how you were trying with enough time and VRAM.

@brian-dellabetta
Copy link
Collaborator

Hi @ubergarm , yes AWQ will require a GPU to run in a reasonable amount of time for most models. We've got that somewhat hard-coded for now, and we'll have better support for offloaded models in a future release.

Yeah, I noticed Qwen publishes some AWQ-ed models (https://huggingface.co/Qwen/Qwen3-32B-AWQ) but no MoE models. There do seem to be lots in the community though 💪

brian-dellabetta added a commit that referenced this issue May 15, 2025
SUMMARY:
- Add QuantizationMixin to AWQModifier so we don't have redundant inputs
(num_bits, symmetric, group_size)
- Move AWQModifier to sequential pipelines, to avoid huge memory
requirements of caching all activations at once.

Regression test results are acceptable, results are all roughly the
same, and within stderr, see test plan below.

Resolves #1409 
Resolves #1369 
Related to #1383
Related to #1406 
Related to #1368 
Related to #1410 

More improvements split into #1435

TEST PLAN:
- [x] Rerun tests to validate
No regression in tests, comparing against those reported in [original
AWQ
PR](#1177 (comment)).
All gsm8k results are within stderr:

| Type            | gsm8k       | wikitext
| ------          | ------      | ----- 
| Old AWQ+QuantModifier Sym          | .1054, .1069     | 9.1931 
| New AWQ+QuantMixin Sym        | .1077, .1084 | 9.1841
| Old AWQ+QuantModifier Asym             | .1274, .1281 | 9.0281
| New AWQ+QuantMixin Asym        | .1312, .1350 | 9.0288

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
@fpaupier
Copy link

While it seems the non-MoE Qwen3 models ran,

Hello @brian-dellabetta - do you have any recommendation for quantizing to FP8 a dense qwen3 (like the 14b) ? Especially which part to ignore like in ignore=["lm_head"])

@brian-dellabetta
Copy link
Collaborator

Hi @fpaupier , for Qwen/Qwen3-14B you should be fine to quantize it with the FP8_DYNAMIC scheme, just ignoring lm_head. With multimodal models we usually exclude the vision component from quantizing, otherwise it's almost always ignore=["lm_head"]

@fpaupier
Copy link

Great, thanks for your insights @brian-dellabetta 👍

@brian-dellabetta
Copy link
Collaborator

Just as a heads-up, we are still working on this. A couple PRs have landed, but we will cut a release soon, after a couple more fixes are in transit, #1435 & #1444

@brian-dellabetta
Copy link
Collaborator

Hi @ehartford , @ubergarm:

quick update: on the #1444 branch, I was able to quantize "Qwen/Qwen3-30B-A3B". I uploaded a checkpoint to https://huggingface.co/nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym/tree/main in case others would like to check it out. You can find more details in the PR summary here, hope to merge it in soon.

Please note two more key PRs need to land to improve memory requirements during saving and when running AWQ through the pipeline:

I am hoping we can wrap these up soon and make a fresh release with AWQ in a much less experimental stage. Some of our wires got crossed in communicating the AWQ feature in llm-compressor, we were premature in some of the announcements. It should be much more robust in the next release. We will do a broader announcement at that time. Appreciate your interest in using it!

@ubergarm
Copy link

ubergarm commented Jun 3, 2025

Oh nice, thanks for the update and congrats on getting Qwen3-30B-A3B going, it is a pretty nice model in my testing both for speed and reasonable quality.

Hrmm, I wish I had some kind of test harness to get apples-apples perplexity and KL-Divergence comparisons with these AWQ quants. I have been using ik_llama.cpp and exllamav3 for some Qwen3-30B-A3B comparisons and the new QTIP/trellis/exl3/ikN_kt quants are looking pretty good compared. The graph is way overpacked though sorry about that haha...

Image

I'm weak on the native transformers side of things given I've been mostly ik_llama.cpp focused lately. hrmm..

Anyways, thanks again for the test quant maybe I'll figure something out to compare it!

@brian-dellabetta
Copy link
Collaborator

@ubergarm very nice plot! For wiktext-2, we usually use lm_eval to calculate perplexity:

lm_eval --model vllm --model_args pretrained="nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym",add_bos_token=True,dtype=bfloat16,max_model_len=4096,gpu_memory_utilization=0.8 --tasks wikitext

vllm (pretrained=nm-testing/Qwen3-30B-A3B-awq-w4a16-g128-sym,add_bos_token=True,dtype=bfloat16,max_model_len=4096,gpu_memory_utilization=0.8), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     5|bits_per_byte  |↓  | 0.6671|±  |   N/A|
|        |       |none  |     5|byte_perplexity|↓  | 1.5879|±  |   N/A|
|        |       |none  |     5|word_perplexity|↓  |11.8539|±  |   N/A|

Will share with someone from the research team

@ubergarm
Copy link

ubergarm commented Jun 3, 2025

@brian-dellabetta

Very cool, thanks for showing me how to run that and giving a clear example and result! I'd have to play with the parameters some as it is always challenging to get apples-apples numbers/comparisons between different systems. Here are the perplexity values I had which seem off from yours. These were done at 2k ctx size i believe which can effect things too if that is different.

Image

Thanks so much for your time and patience on this thread haha! Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants